Artificial Intelligence-Aided Diagnosis Solution by Enhancing the Edge Features of Medical Images

Bone malignant tumors are metastatic and aggressive. The manual screening of medical images is time-consuming and laborious, and computer technology is now being introduced to aid in diagnosis. Due to a large amount of noise and blurred lesion edges in osteosarcoma MRI images, high-precision segmentation methods require large computational resources and are difficult to use in developing countries with limited conditions. Therefore, this study proposes an artificial intelligence-aided diagnosis scheme by enhancing image edge features. First, a threshold screening filter (TSF) was used to pre-screen the MRI images to filter redundant data. Then, a fast NLM algorithm was introduced for denoising. Finally, a segmentation method with edge enhancement (TBNet) was designed to segment the pre-processed images by fusing Transformer based on the UNet network. TBNet is based on skip-free connected U-Net and includes a channel-edge cross-fusion transformer and a segmentation method with a combined loss function. This solution optimizes diagnostic efficiency and solves the segmentation problem of blurred edges, providing more help and reference for doctors to diagnose osteosarcoma. The results based on more than 4000 osteosarcoma MRI images show that our proposed method has a good segmentation effect and performance, with Dice Similarity Coefficient (DSC) reaching 0.949, and show that other evaluation indexes such as Intersection of Union (IOU) and recall are better than other methods.


Introduction
Osteosarcoma is the most common solid tumor of bone origin, accounting for approximately 20% of primary sarcomas of bone [1]. Osteosarcomas frequently occur in adolescents or children under 20 years of age, with more than 75% of patients having an age of onset younger than 25 years [2]. Although some osteosarcomas can be cured by surgical means, some cases remain that are highly fatal even with the most aggressive treatment measures. With the application of comprehensive therapeutic approaches, a cure rate of 65-70% can be achieved in some patients. However, osteosarcoma is prone to lesions with long treatment cycles and poor prognosis, leading to high mortality rates of malignant tumors [3]. Patients with advanced osteosarcoma had a 5-year survival rate of only 20% [4]. The early detection of malignancies can significantly enhance disease cure rates and minimize patient death [5].
In the current application of evaluation of suspected osteosarcoma, radiographs are inaccurate in determining tumor borders, often resulting in results smaller than the actual size of the tumor [6]. Although CT shows the extent of bone destruction well, MRI performs better than CT in showing the extent of focal lesions [7]. MRI has the advantage of providing multidimensional images and allowing for more sensitive quantification of the extent of marrow cavity involvement [8]. Therefore, MRI images are very important for physicians to clinically examine patients with osteosarcoma.
The artificial intelligence-aided detection of medical images is important for predicting patient outcomes and monitoring disease progression with the development of treatment strategies [9][10][11]. Due to the general underdevelopment of health care systems, most countries, especially developing countries, still suffer from a strain and uneven distribution of health care resources [12]. Many hospitals have difficulty in meeting the hardware and staffing requirements for osteosarcoma treatment. Most diagnoses of osteosarcoma rely on the manual recognition of pictures [13]. However, the volume of image data from patients with osteosarcoma is enormous, but few images are of value. Only about 20 of the approximately 700 MRI images generated per patient with osteosarcoma may be useful to the physician for diagnosis [14,15]. The manual screening and processing of validated osteosarcoma images by physicians is a time-consuming and laborious process [16]. Considering the lack of uniformity in the diagnosis of histological features of osteosarcoma, it requires a high level of expertise and knowledge base in pathology on the part of the diagnosing physician. Diagnosis by inexperienced physicians is highly subjective, which can lead to an increased rate of misdiagnosis.
At present, image processing technology is developing rapidly, especially in image segmentation, and new methods are constantly proposed [17]. As medical image processing technology progress, more and more imaging modalities are being employed to diagnose osteosarcoma [18]. Analysis of the extent of tumor infiltration and the boundaries of tumor infiltration by MRI helps clinicians to localize the tumor. MRI images are susceptible to various noise sources (including thermal noise and physiological noise [19] caused by mechanical defects and external signals), which significantly reduces the image quality. The denoising operation is necessary to improve diagnostic efficiency. In addition, osteosarcoma itself has complex local tissue formation and morphological changes, difficult-to-maintain marginal features, and blurred tumor boundaries. Due to this, some medical image processing algorithms are less effective at identifying osteosarcomas, and it is challenging to obtain global multidirectional features and implicit features [20], as well as both accuracy and performance. It is crucial to improve the efficiency of osteosarcoma diagnosis by effectively extracting global features and solving the edge ambiguity segmentation problem without consuming too many computational resources and time costs. Edge feature-based methods are processing models employed in medical images, such as Transformer which has achieved wide application in the field of medical images by taking advantage of global modeling [20][21][22]. Such methods have achieved good results when dealing with simple tasks. When faced with the MRI segmentation task of osteosarcoma where there are a large number of complex boundaries and blurred lesion edges, it is difficult to acquire edge features globally and in multiple directions and to mine the implicit features, making the model accuracy and robustness poor [22]. Therefore, their segmentation results did not meet the expectations.
To improve the recognition accuracy of tumors, this study proposes an artificial intelligence-assisted diagnostic scheme for the MRI images of osteosarcoma with edgeenhanced features. It first processes the raw MRI images using threshold screening filtering (TSF) to filter out the redundant data in lesion-free regions. Then, the fast NLM algorithm and Fourier transform are fused to reduce the noise of the images. Finally, a segmentation network (TBNet) with edge-enhancement features is designed to improve recognition accuracy by enhancing the tumor edge features. It effectively alleviates the problem of the imprecise identification of fuzzy tumor boundaries by existing methods. The network introduces a channeled edge-crossing transformer instead of Unet's skip connection, and combines loss functions to globally segment tumor regions of different sizes at multiple scales, optimizing the effect of osteosarcoma segmentation.

Related Work
The segmentation of medical images by computer technology (including CT, MRI, X-ray, etc.), which in turn helps doctors to diagnose diseases, is gradually becoming a popular research topic. Listed and described below are some of the mainstream algorithms in the field.
Recently, visual transformers have been increasingly used for the segmentation of medical pictures. Jienengchen et al. proposed TransUNet [21], the first transformer-based segmentation method in medical imaging. The first transformer-based pure U-shaped design was Swin-Unet [22] proposed by Liu [26]. TransFuse network [27] was suggested by Yundong Zhang et al., which mixes transformers and CNNs in tandem, proposing a fusion of the two techniques, in contrast to most earlier efforts, which replaced the convolutional layers in U-Net networks with the transformer or cascaded the two. HiFormer, a hierarchical multiscale representation transformer was recently proposed by Moein Heidari et al. [28]. Chen Wei et al. presented that HRSTNet [29] replaces the convolutional layer with a transformer module that creates varied resolution feature mapping information. Ruina Sun et al. [30] introduced an effective image classification segmentation technique based on an enhanced Swin transformer, which was designed specifically for lung cancer classification and segmentation. UNeXt model [31]  In the disease diagnosis of osteosarcoma, image processing by computer technology as an aid to diagnosis has gradually become a research hotspot. The accurate classification of the morphology of different stages of osteosarcoma can enable the timely control of the spread and treatment of osteosarcoma patients in early diagnosis. In the article [33], a CNN has been used to pre-train a publicly available dataset of osteosarcoma tissue images to avoid extensive metastasis. Yu Fu [34] developed a DS-Net network capable of automatically classifying histological images. Hosein , a new convolutional neural network structure with multiple tandem CNNs. Rahad Arman Nabid [36] proposed a convolutional network to evaluate the grading of patients with osteosarcoma, but the model suffers from the problem of simple overfitting. To identify osteosarcoma cells from osteoblastic cells (MSC), the method proposed by Mario D'Acunto et al. [37] has an accuracy close to one and allows for the study of single cells but requires a huge amount of cellular data. Parlak et al. [38] employed diffusion-weighted imaging to accurately show Ewing's sarcoma and osteogenic sarcoma by measuring the expression diffusion coefficient (ADC) values for borderline case segmentation of Ewing's sarcoma and osteogenic sarcoma.
Many scholars in the literature have proposed the application of many image-processing techniques for the early identification and prediction of treatment options for patients with osteosarcoma. Su Young Jeong et al. [39] proposed the use of machine learning methods incorporating baseline 18-FDG positron emission tomography to predict textural features in scanned images. To this end, Hyung-Jun Im et al. [40] proposed various segmentation methods for pseudo-myelinating lesions, including the relative background threshold method, the gradient-based method (PETedge), and Bluse Otsu (MO-PET). Shuai Limei et al. proposed W-net+ [41], a network structure based on a dense jump connection structure and cascaded dual U-Net. WB Huang et al. [42] proposed a fully automated MRI method for osteosarcoma detection. It is used to identify tumors with irregular structure and shape by using conditional random fields.
The above analysis demonstrates that picture segmentation techniques are becoming increasingly significant in illness diagnosis and prognosis assessment. However, because the pictures are vulnerable to noise, edge features in image segmentation of osteosarcoma are still challenging to retain and segmentation accuracy needs to be improved. In this study, an artificial intelligence-aided diagnosis method for osteosarcoma with edge-enhancement features is proposed, which improves the accuracy of tumor recognition by enhancing edge information.

Methodology
With the development of computer image processing technology, artificial intelligence solutions are widely used in the medical field. Its importance is to solve the problems of high consumption of medical resources and low efficiency of disease diagnosis in developing countries [43][44][45][46]. However, such methods still present major challenges in the recognition of tumors. Taking osteosarcoma MRI images as an example, existing methods have difficulty in handling images with complex lesions and blurred edges [10,47]. This paper proposes an artificial intelligence-aided diagnosis scheme to provide more options for osteosarcoma-aided diagnosis in developing countries. The overall design of the scheme is in Figure 1. First, we perform MRI image pre-screening by Threshold Screening Filter (TSF) to filter redundant data and extract useful images for submission to the next step; then, we introduce a combined fast Fourier transform NLM algorithm for noise reduction in MRI images to improve the diagnostic quality. Finally, to improve the recognition accuracy of the model for tumor edges, a U-shaped network with a fused transformer featuring edge enhancement is used to segment the pre-processed images. The channeled transformer bridges the semantic gap that exists in the UNet network. The method also introduces the edge-enhancement module (BAB) with a combined loss function to optimize the effect of edge segmentation.  [41], a network structure based on a dense jump connection structure and cascaded dual U-Net. WB Huang et al. [42] proposed a fully automated MRI method for osteosarcoma detection. It is used to identify tumors with irregular structure and shape by using conditional random fields. The above analysis demonstrates that picture segmentation techniques are becoming increasingly significant in illness diagnosis and prognosis assessment. However, because the pictures are vulnerable to noise, edge features in image segmentation of osteosarcoma are still challenging to retain and segmentation accuracy needs to be improved. In this study, an artificial intelligence-aided diagnosis method for osteosarcoma with edge-enhancement features is proposed, which improves the accuracy of tumor recognition by enhancing edge information.

Methodology
With the development of computer image processing technology, artificial intelligence solutions are widely used in the medical field. Its importance is to solve the problems of high consumption of medical resources and low efficiency of disease diagnosis in developing countries [43][44][45][46]. However, such methods still present major challenges in the recognition of tumors. Taking osteosarcoma MRI images as an example, existing methods have difficulty in handling images with complex lesions and blurred edges [10,47]. This paper proposes an artificial intelligence-aided diagnosis scheme to provide more options for osteosarcoma-aided diagnosis in developing countries. The overall design of the scheme is in Figure 1. First, we perform MRI image pre-screening by Threshold Screening Filter (TSF) to filter redundant data and extract useful images for submission to the next step; then, we introduce a combined fast Fourier transform NLM algorithm for noise reduction in MRI images to improve the diagnostic quality. Finally, to improve the recognition accuracy of the model for tumor edges, a U-shaped network with a fused transformer featuring edge enhancement is used to segment the pre-processed images. The channeled transformer bridges the semantic gap that exists in the UNet network. The method also introduces the edge-enhancement module (BAB) with a combined loss function to optimize the effect of edge segmentation.  The artificial intelligence-aided diagnosis scheme of the osteosarcoma MRI images we constructed is divided into two main sections: Section 3.1 is data preprocessing; Section 3.2 is the image segmentation model of osteosarcoma MRI. In Section 3.1, we are mainly divided into two parts, which are the pre-screening operation based on Threshold Screening Filter (TSF) and the noise-reduction processing based on a fast NLM. Table 1 shows the main mathematical symbols of this chapter and their annotations. The modified Bessel function η 1 , η 2 Real and virtual channel impact k(a) The activation function of TSF The function of weighted similarity The DCT transform operator (FFT) and its inverse Z(i) The normalization constant φ(·) The boundary level set The similarity matrix and Mask edge diagram ψ(·) The instance normalization δ(·) The ReLU operator f i−1 Supplementary layer feature map for layer i − 1 g(·) The probabilistic output of the network The purpose of pre-screening is to obtain some valuable images from the original MRI images. For MRI image filtering, we used a threshold filter (TSF).
where λ T is a hyperparameter of the TSF controlling the screening threshold of the lesion. Noise significantly reduces the accuracy and there is no greater need for edge details during pre-screening for the time being. Therefore, we introduced the discrete Fourier transform (DFT) [48] for coarse initial denoising of noise at high frequencies first. B a is the coarse initial noise reduction method for B a .
where F 2D represents the DCT transform operator, F −1 2D represents the inverse of F 2D , and λ FFT2 is the fixed threshold parameter. Additionally, s is defined as: In Equation (1), k is: If the output of k(a) is 1, the location of the window block B a contains the lesion area, and the image is kept to mark the image as "lesion image". When it is not 1, the image of window block B a is considered invalid and the sliding traversal of the entire image continues. After the original data set is pre-screened, the redundant data without lesions are eliminated, and the useful and valuable image data are retained for further operations.

Noise Reduction Based on Fast NLM Algorithm
The degree of tumor invasion and the invasion boundary is different, which affects the recognition effect of the model to a certain extent. The source of noise influence in MRI images is different from the usual images. The usual images are perturbed by Gaussian white noise. MRI images, on the other hand, are influenced by the bias induced by the Rician distribution noise associated with the MR acquisition system signal, as well as Gaussian white noise. Real and imaginary channels are influenced by and collect complex MR data X.
where S is the original MR image and θ is the phase. Changing the noise distribution to Rician can be expressed as: By first squaring the magnitude of the data X and then considering the expectation of both sides, it is possible to determine the bias caused by the Rician distribution.
The non-local mean (NLM) takes advantage of the redundancy of the image, and the pixels in the image filtered by non-local averaging are a weighted average of all other pixels [7]. NLM can successfully remove the aforementioned image noise. This number is determined by examining the picture block and determining the sliding window's similarity to the selected window. The NLM method's core equation is Equation (7).
is the weight to calculate the similarity between two patches, h represents the filtering parameter, and p represents the Euclidean distance between two patches. The original NLM algorithm takes a lot of time to compute the similarity weights and has a high distance computation complexity [49]. To keep the NLM accurate while reducing the consumption of computational resources, we introduce the Fast Fourier Transform (FFT) strategy to implement a fast algorithm for computing pixel-level NLM. v is the input noisy image, V is the patch, c s is the patch edge half-length, c is the patch side length as c = 2 × c s + 1, and c 2 is the number of pixels in the patch procedure. First, we rearrange the loop to consider all pixels i of all translation vectors t ∈ [[−C s , +C s ]] 2 . In the context of patches distance, the Euclidean distance can be calculated as follows: Then, in this case, the weighted norm of the patch difference is a discrete convolution: where * denotes discrete convolution, K(b) = K(−b). It is calculated by F and its inverse F −1 . The time complexity of FFT is O ND 2 log(N) , subject to the computation of any translation vector t ∈ [[−C s , +C s ]] 2 , and the computation of NLM weights is independent of the patch size. Assuming that the total number of image pixels is N, the introduction of FFT enables the NLM algorithm to obtain a good speedup. Specifically, the time complexity is reduced from the original O ND 2 c 2 to O(ND 2 log(N). As shown in Figure 1, after the image data are processed by noise reduction, they are then put into the subsequent segmentation operation.

Tumor Localization
The original data are input to the segmentation model TBNet for segmentation after a series of data preprocessing, such as image pre-screening and noise reduction. TBNet is a U-Net network with edge-enhanced features of a fusion transformer designed by us, which can effectively solve the edge blur segmentation problem caused by blurred lesion edges. TBNet is based on the UNet network without skip connection, introducing a multi-head cross-fusion transformer (MCT), edge-enhanced cross-attention module (ECA) with combined loss function; this optimizes the segmentation effect and solves the edge blur segmentation problem. The model mainly consists of U-Net without a skip-connection mechanism [50], channel edge cross-fusion transformer (including multi-head cross-fusion transformer module MCT and edge-enhanced cross-attention module ECA), and combined loss function. Figure 2 shows the general design of the model.
The time complexity of FFT is ( 2 ( )), subject to the computation of any translation vector ∈ [[− , + ]] 2 , and the computation of NLM weights is independent of the patch size. Assuming that the total number of image pixels is N, the introduction of FFT enables the NLM algorithm to obtain a good speedup. Specifically, the time complexity is reduced from the original ( 2 2 ) to ( 2 ( ). As shown in Figure 1, after the image data are processed by noise reduction, they are then put into the subsequent segmentation operation.

Tumor Localization
The original data are input to the segmentation model TBNet for segmentation after a series of data preprocessing, such as image pre-screening and noise reduction. TBNet is a U-Net network with edge-enhanced features of a fusion transformer designed by us, which can effectively solve the edge blur segmentation problem caused by blurred lesion edges. TBNet is based on the UNet network without skip connection, introducing a multihead cross-fusion transformer (MCT), edge-enhanced cross-attention module (ECA) with combined loss function; this optimizes the segmentation effect and solves the edge blur segmentation problem. The model mainly consists of U-Net without a skip-connection mechanism [50], channel edge cross-fusion transformer (including multi-head cross-fusion transformer module MCT and edge-enhanced cross-attention module ECA), and combined loss function. Figure 2 shows the general design of the model. The original U-net does not work well in the osteosarcoma diagnosis task. Osteosarcoma itself has complex morphological changes, difficult-to-maintain edge features, and blurred tumor boundaries, coupled with the fact that MRI images, the tools relied on for diagnosis, are vulnerable to noise. This makes the skip connection mechanism have a certain semantic gap in the osteosarcoma segmentation recognition task, makes the stage features incompatible, and makes it difficult to obtain edge features and mine the implicit features globally and in multiple directions, thus making the model accuracy and robustness poor with a certain impact on the segmentation. We introduce a cross-fusion channel transformer with edge-enhancement features to replace the jump connection while taking advantage of the transformer and Unet for the cross-fusion of multi-scale channel information, solving the problem of semantic hierarchical inconsistency, and enhancing edge information to improve the edge segmentation ambiguity problem. The original U-net does not work well in the osteosarcoma diagnosis task. Osteosarcoma itself has complex morphological changes, difficult-to-maintain edge features, and blurred tumor boundaries, coupled with the fact that MRI images, the tools relied on for diagnosis, are vulnerable to noise. This makes the skip connection mechanism have a certain semantic gap in the osteosarcoma segmentation recognition task, makes the stage features incompatible, and makes it difficult to obtain edge features and mine the implicit features globally and in multiple directions, thus making the model accuracy and robustness poor with a certain impact on the segmentation. We introduce a cross-fusion channel transformer with edge-enhancement features to replace the jump connection while taking advantage of the transformer and Unet for the cross-fusion of multi-scale channel information, solving the problem of semantic hierarchical inconsistency, and enhancing edge information to improve the edge segmentation ambiguity problem. The channelized edge cross-fusion transformer consists of MCT for encoder feature transformation, MCT for encoder feature conversion, and the edge-enhancement crossattention module (ECA) for decoder feature fusion.
(1) Multi-head cross-fusion transformer (MCT) for encoder feature transformation MCT is a multi-scale global feature investigation of osteosarcoma MRI images using transformer long-dependent modeling to fuse multi-scale encoder features. It is divided into three stages: Embedding Multi-scale Features, Multi-Crossing Attention, and Multi-Layer Perceptron. Figure 3 shows the design of the MCT. (1) Multi-head cross-fusion transformer (MCT) for encoder feature transformation MCT is a multi-scale global feature investigation of osteosarcoma MRI images usin transformer long-dependent modeling to fuse multi-scale encoder features. It is divide into three stages: Embedding Multi-scale Features, Multi-Crossing Attention, and Mult Layer Perceptron. Figure 3 shows the design of the MCT. Embedding Multi-Scale Features. The embedding of multi-scale features starts b labeling the features with the of the multi-scale feature of the given osteosarcoma MR image, and reshaping the 2D patch of the flat feature sequence so that the patch size is 2 ⁄ , 4 ⁄ , and 8 ⁄ . In this process, the channel size remains the same. We then conne the four layers as keys and values = ( 1 , 2 , 3 , 4 ) and feed them into a mult headed cross-attention module to encode the channels and dependencies to refine the fe tures at the encoder level of each osteosarcoma MRI image using multiscale features.
Multi-Crossing Attention. As shown in Figure 4, there are five inputs required fo this module. Let × , × , × , an × , × , × be the weights of different inputs; is the s quence length, ( = 1,2,3,4) is the channel size to skip the concatenated layer, and th four sizes we use are 1 = 64 , 2 = 128 , 3 = 256 , 4 = 512. Through the cross-noti mechanism, a similarity matrix is generated and is weighted.
where (•) and (•) denote the instance normalization and the softmax function, respe tively. We change the original method of the attention operation along the patch-axis the self-attention mechanism and change the direction to along the channel direction. In stance normalization is also used to normalize each instance's similarity matrix on th similar mapping, allowing for the gradient to flow smoothly. In the case of N-head focu = ∑ 0< < , ∈ (1 Embedding Multi-Scale Features. The embedding of multi-scale features starts by labeling the features with the I i of the multi-scale feature of the given osteosarcoma MRI image, and reshaping the 2D patch of the flat feature sequence so that the patch size is P, P/2, P/4, and P/8. In this process, the channel size remains the same. We then connect the four layers as keys and values L Σ = Concat(L 1 , L 2 , L 3 , L 4 ) and feed them into a multiheaded cross-attention module to encode the channels and dependencies to refine the features at the encoder level of each osteosarcoma MRI image using multiscale features.
Multi-Crossing Attention. As shown in Figure 4, there are five inputs required for this module. Let T i R C i ×d , T Key R C Σ ×d , T Value R C Σ ×d , and W i R C i ×d , W Key R C Σ ×d , W Value R C Σ ×d be the weights of different inputs; d is the sequence length, C i (i = 1, 2, 3, 4) is the channel size to skip the concatenated layer, and the four sizes we use are C 1 = 64, C 2 = 128, C 3 = 256, C 4 = 512. Through the cross-notice mechanism, a similarity matrix M i is generated and T Value is weighted.
where ψ(·) and σ(·) denote the instance normalization and the softmax function, respectively. We change the original method of the attention operation along the patch-axis in the self-attention mechanism and change the direction to along the channel direction. Instance normalization is also used to normalize each instance's similarity matrix on the similar mapping, allowing for the gradient to flow smoothly. In the case of N-head focus: Multi-Layer Perceptron (MLP). After the multiple cross-notice mechanisms, the results are fed into the MLP with the residual structure and residual operator to obtain the output. An L-layer transformer is constructed by repeating the operation equation (14) Diagnostics 2023, 13, x FOR PEER REVIEW 9 of 21

Multi-Layer Perceptron (MLP).
After the multiple cross-notice mechanisms, the results are fed into the MLP with the residual structure and residual operator to obtain the output. An L-layer transformer is constructed by repeating the operation equation (14) L times. The four outputs 1 , 2 , 3 , 4 of the L-th layer are reconstructed by up-sampling the operations after convolutional layers. They are connected to the decoder features 1 , 2 , 3 , 4 , respectively, and fed to the Edge-Enhanced Cross-Attention (ECA).
The segmentation of the focus area in the MRI image is a binary classification task. The introduction of too many layers of transformers will lead to the explosive growth of computational complexity and model over-fitting problems. After balancing resource cost and segmentation accuracy, N and L are set to 4.
(2) Edge-Enhanced Cross Attention (ECA) module for decoder feature fusion As shown in Figure 5, we propose a channel-based edge-enhancement cross-attention module to better fuse the semantic inconsistency characteristics between the channel transformer and the U-Net decoder, as well as to correct for fuzzy edge segmentation and the lack of some regions in osteosarcoma MRI images. The module is split into two sections: Decoder Channel Crossing Attention Module and Edge-Enhancement Module. It can not only direct the channel and information filtering of interceptor features to minimize ambiguity in decoder features, but it can also increase edge information to replenish missing regions.  The segmentation of the focus area in the MRI image is a binary classification task. The introduction of too many layers of transformers will lead to the explosive growth of computational complexity and model over-fitting problems. After balancing resource cost and segmentation accuracy, N and L are set to 4.
(2) Edge-Enhanced Cross Attention (ECA) module for decoder feature fusion As shown in Figure 5, we propose a channel-based edge-enhancement cross-attention module to better fuse the semantic inconsistency characteristics between the channel transformer and the U-Net decoder, as well as to correct for fuzzy edge segmentation and the lack of some regions in osteosarcoma MRI images. The module is split into two sections: Decoder Channel Crossing Attention Module and Edge-Enhancement Module. It can not only direct the channel and information filtering of interceptor features to minimize ambiguity in decoder features, but it can also increase edge information to replenish missing regions.

Multi-Layer Perceptron (MLP).
After the multiple cross-notice mechanisms, the re sults are fed into the MLP with the residual structure and residual operator to obtain th output. An L-layer transformer is constructed by repeating the operation equation (14) times. The four outputs 1 , 2 , 3 , 4 of the L-th layer are reconstructed by up-samplin the operations after convolutional layers. They are connected to the decoder feature 1 , 2 , 3 , 4 , respectively, and fed to the Edge-Enhanced Cross-Attention (ECA).
The segmentation of the focus area in the MRI image is a binary classification task The introduction of too many layers of transformers will lead to the explosive growth o computational complexity and model over-fitting problems. After balancing resource co and segmentation accuracy, N and L are set to 4.
(2) Edge-Enhanced Cross Attention (ECA) module for decoder feature fusion As shown in Figure 5, we propose a channel-based edge-enhancement cross-atten tion module to better fuse the semantic inconsistency characteristics between the channe transformer and the U-Net decoder, as well as to correct for fuzzy edge segmentation an the lack of some regions in osteosarcoma MRI images. The module is split into two se tions: Decoder Channel Crossing Attention Module and Edge-Enhancement Module. can not only direct the channel and information filtering of interceptor features to min mize ambiguity in decoder features, but it can also increase edge information to replenis missing regions.

Decoder Channel Crossing Attention Module.
We take level i transformer output O i ∈ R C×H×W and F i ∈ R C×H×W as the input for cross-attention. Spatial squeezing is performed through the GAP layer to construct attention masks.
where the resulting vector ξ(x) ∈ R C×1×1 , the k-th channel ξ(x) = 1 H×W ∑ H i=1 ∑ W j=1 X k (i, j), x 1 ∈ R C×C and x 2 ∈ R C×C . We recalibrate and excite O i .
Edge-Enhancement Module. The blurred focus edge in MRI images often leads to inaccurate tumor segmentation. To solve the problem of edge fuzzy segmentation, the edge-enhancement module is introduced into the cross-attention module of the feature fusion part of the encoder.
For the edge attention mechanism, we are inspired by the spatial channel compression and stimulus attention module and designed the attention module that focuses more on the edge feature information with the same simplicity and without increasing the parameters. The specific process can be expressed as follows: first, the input feature map F ab ∈ R C×H×W is compressed in channel and space, and the compressed feature map F a ∈ R H×W×1 obtained after the compression is multiplied with the vector F b ∈ R 1×1×C to obtain a weight W B ∈ R C×H×W of the same size as the input. This not only provides a corresponding weight for each pixel of the input feature map but can also highlight more important location information at the edges and suppress location information of a small value.
where × represents the expansion to direct multiplication and represents the pixel-bypixel multiplication. Finally, the output O i of the decoder channel cross-attention module and the output F i of the edge-enhancement module will converge to obtain the final output F i . F i will continue to work as an input to the i − 1 layer. In addition, the following loss function is proposed.
It combines dice loss and boundary loss: where P is the predicted value, T denotes the true label value, G is Ground Truth, and g(·) is the softmax probability output of the network. If i ∈ G, then φ G (i) is the negative value of the distance between the point and G; conversely, it has a positive value. We introduce parameters α and β as balancing coefficients in the defining Equation (19) of the integrated loss function L. By focusing on both the tumor region and edge information, our method effectively alleviates the edge-blurring problem that has not been addressed in previous research methods.

Dataset
The data for this article are from the Research Center for Artificial Intelligence of Monash University [51]. The original dataset included more than 4000 MRI image samples from 204 patients with osteosarcoma. All patients are diagnosed by specialists at the hospital based on clinical presentation and pathological images. All images were rotated by 90 • , 180 • , and 270 • , respectively, to improve the generalization ability of the model. Finally, the dataset was partitioned into a training set, a validation set, and a test set in a ratio of 7:1:2. In addition, the ground-truth label of our MRI images was carried out by a collaboration of three doctors at the hospital. As shown in Figure 1, in this mission, only the focal area needs to be segmented. Therefore, the ground-truth label consists of two parts-the tumor area is the foreground and the other part is the background.
The experiments in this paper were conducted under the Ubuntu 18.04.5 operating system with Python as the main programming language, Python version 3.8, and PyTorch version 1.10.0. The GPU used in the experiments was a GTX 3060, and the CPU used was an Intel(R) Xeon(R) E5-2630L v3 with 30 GB of RAM.

Evaluation Metrics
Based on an earlier study, classification accuracy (CA) expresses the ratio between the screened accurate samples and the expected samples [52]. We used CA to assess the validity of the MRI image pre-screening algorithm. Let C R be the correctly classified image, i.e., the lesioned image that is correctly screened out by the normal image, C F be the incorrect classification result, and C R + C F be denoted as all the images of the input. The overall error of this process can be denoted by C F .
The peak signal-to-noise ratio (PSNR) can objectively quantify the denoising effect. NMISE represents the normalized mean squared error of integration between the denoised and noise-free data [53].
Intersection of Union (IOU) is a metric that is commonly used to measure the similarity between the predicted tumor region and the actual tumor region [54]. Specifically, IOU is the ratio of the intersection region between the judged tumor region I 1 and the real tumor region I 2 . The larger this index is (the closer it is to 1), the greater the impact of the segmentation.
DSC is commonly used to indicate segmentation effects on a similarity measure from 0 to 1 [55]. The DSC is calculated as the ratio of twice the area of the junction of I 1 and I 2 to the entire area, as expressed in the following equation. The best segmentation effect is obtained when the DSC is 1.
The segmentation network's performance is explained using a confusion matrix. Among them, TP indicates that the area is predicted and is actually a focus area. TN indicates that both predicted and actual tissues are normal. FP represents the predicted tumor area, which is actually normal tissue. On the contrary, FN indicates that it is predicted to be normal, but it is actually a tumor. In this experiment, accuracy (ACC), precision (Pre), recall (Re), and F1-score (F1) are calculated by using the confusion matrix to measure the performance of each network [56].

Algorithm Comparison
We perform a comparative experimental analysis with our proposed TBNet using the following method. These methods are briefly described below.
(1) A fully convolutional network (FCN) is a pixel-level classification of images, using skip structures to achieve fine segmentation [57]. In this paper, 2 networks with 8 and 16 up-sampling are used FCN-8s and FCN-16s.
(2) The PSPNet focuses on the pyramid pool module as a technique of extracting global contextual information, collecting and fusing contextual information at various scales, thus being particularly effective in acquiring global information [58]. (3) The MSFCN is a fully convolutional network with many supervised lateral output layers for automatic volume segmentation [59]. To enable the effective learning of local and global visual features, a supervised lateral layer is added to the three layers of the convolutional network to provide a system-level structure to guide multidimensional feature learning. (4) Multi-scale residual networks (MSRN) [60] can adaptively identify image features at different scales and make them interact to produce effective high-resolution picture information. The generation of structural multiscale residual block MSRBs is achieved by combining convolution kernels of different sizes on the basis of the creation of residual blocks. (5) U-Net is a symmetric U-shaped structure that enables image-semantic-level segmentation [61]. The left side is a convolutional layer (systolic path) responsible for feature extraction and the right side is an up-sampling layer (extended path) responsible for feature reduction. It can use very little data to obtain the best results. (6) Feature pyramid networks (FPN) [62] use low-level feature semantic information with accurate target locations and information-rich high-level feature semantic information to make predictions independently at different feature layers using multi-scale feature fusion. (7) The UTNet network [24] hybrid transformer design incorporates self-attention into convolutional neural networks. The self-attention module is used in both the encoder and decoder in UTNet, and it is combined with relative position coding to considerably minimize the complexity of the self-attention process. In addition, a self-attentive decoder is presented to recover fine-grained features from the encoder's skipped connections, addressing the problem that the transformer requires a considerable quantity of data to acquire the visual sensing bias. (8) The TransUNet network [21] combines the advantages of a transformer and U-Net in a hybrid CNN transformer design. The global context extraction encodes the tokenized picture blocks in the CNN feature map as input sequences by the transformer. The decoder performs up-sampling to achieve accurate localization. This operation is performed before combining the encoded features with the high-resolution CNN feature map.

Influence of Super Parameters
In the preparation before training, to improve the efficiency of training and model performance, we first performed the pre-screening and noise-reduction operation of the data to avoid meaningless redundant and invalid data, which affect the improvement in segmentation efficiency. For the fixed parameters of the TSF in the pre-screening operation, referring to the general settings of the study, we set n 1 to 7 and the fixed threshold parameter λ FFT2 to 0.82.
In addition, for the setting of parameter λ T in TSF, we found that the reasonable setting interval of λ T should be 90-160 by repeating the experiment and statistically analyzing the lesion areas on the MRI images of different osteosarcomas. If λ T is set too small or too large, the accuracy of screening is reduced. The former is because it causes the TSF to incorrectly classify lesioned images as normal, reducing the accuracy of the screening process. The latter is because the TSF will incorrectly label healthy images as lesions. The average classification accuracy is maximum when 120 < λ T < 140, which may reach 95.82%, and the optimal interval of λ T can be defined as 120-140. The trend of classification accuracy (CA) can be shown in Figure 6 for a suitable range of parameters. Diagnostics 2023, 13, x FOR PEER REVIEW 13 of 21 Figure 6. Trend diagram of CA with λ T .

Results
From the MRI images screened by the TSF algorithm (the lesion images used for the experiment and the rejected normal images), 100 images were selected separately and the accuracy of the model was judged by the confusion matrix, as shown in Table 2. The predicted values indicate the results judged by the TSF algorithm. The classification results judged by the physician after examining the MRI images are represented by the actual values. Among them, the model correctly classified 93 normal images and lesion images, judged 3 lesion images as normal, and judged 4 normal images as lesion images. According to the analysis in Section 4.4, the best CA value of the TSF algorithm reached above 0.95. The CA value of the randomly selected results in Table 2 reached 0.93. It proves that the TSF algorithm is as expected and can effectively eliminate redundant data from lesionfree regions and filter out valuable training images. To verify the effectiveness of the denoising process, we used the evaluation metric PSNR to quantify the validity of this operation. Figure 7 shows the reconstruction results of the images before and after noise reduction with the corresponding PSNR values. By comparison, it can be observed that the post-noise-reduction image (Figure 7b) is clearer than the image before the noise-reduction treatment (Figure 7a) and is very close to the noise-free image (Figure 7c). Meanwhile, we can know that the PNSR value will be higher after the noise-reduction processing.
Specifically, Figure 8 shows the effect comparison of some randomly selected images before and after the noise-reduction process to prove that noise reduction has a great impact on the recognition accuracy of tumors. Comparing the segmentation effect before noise reduction ( Figure 8B) and after noise reduction ( Figure 8C), the segmentation effect is similar in the global view. However, we can visually see from the details of the local area magnification that the segmentation effect after noise reduction ( Figure 8C) compensates for the detailed contour of the edge area of the lesion compared with the segmentation effect without noise reduction ( Figure 8B), which is closer to the real image. The results demonstrate that noise-reduction processing increases accuracy.

Results
From the MRI images screened by the TSF algorithm (the lesion images used for the experiment and the rejected normal images), 100 images were selected separately and the accuracy of the model was judged by the confusion matrix, as shown in Table 2. The predicted values indicate the results judged by the TSF algorithm. The classification results judged by the physician after examining the MRI images are represented by the actual values. Among them, the model correctly classified 93 normal images and lesion images, judged 3 lesion images as normal, and judged 4 normal images as lesion images. According to the analysis in Section 4.4, the best CA value of the TSF algorithm reached above 0.95. The CA value of the randomly selected results in Table 2 reached 0.93. It proves that the TSF algorithm is as expected and can effectively eliminate redundant data from lesion-free regions and filter out valuable training images. Normal image 3 44 To verify the effectiveness of the denoising process, we used the evaluation metric PSNR to quantify the validity of this operation. Figure 7 shows the reconstruction results of the images before and after noise reduction with the corresponding PSNR values. By comparison, it can be observed that the post-noise-reduction image (Figure 7b) is clearer than the image before the noise-reduction treatment (Figure 7a) and is very close to the noise-free image (Figure 7c). Meanwhile, we can know that the PNSR value will be higher after the noise-reduction processing.
Specifically, Figure 8 shows the effect comparison of some randomly selected images before and after the noise-reduction process to prove that noise reduction has a great impact on the recognition accuracy of tumors. Comparing the segmentation effect before noise reduction ( Figure 8B) and after noise reduction ( Figure 8C), the segmentation effect is similar in the global view. However, we can visually see from the details of the local area magnification that the segmentation effect after noise reduction ( Figure 8C) compensates for the detailed contour of the edge area of the lesion compared with the segmentation effect without noise reduction ( Figure 8B), which is closer to the real image. The results demonstrate that noise-reduction processing increases accuracy.     In Table 3, we further visualize the effect of the TSF filtering algorithm (λ T = 133) and the NLM noise-reduction algorithm on the performance of the model. When the original MRI images were processed directly by the TBNet network, all the other indexes performed poorly, although its Re value reached 0.974. When images without tumor regions (or tumor regions barely visible) were removed using the TSF method, the performance of our model improved more significantly. Precision increased by around 0.009 after image pre-screening, F1 increased by about 0.003, recall increased by about 0.001, and DSC increased by about 0.006. The most important DSC index improved by roughly 0.012 after the noise-reduction treatment, whereas precision, F1, and IOU improved by 0.005, 0.002, and 0.011, respectively. After the screening of the dataset by Threshold Screening Filter (TSF), more than 3000 MRI images remained. All comparison methods, including UNet, were experimented with using the pre-processed images to ensure fairness. The results of each model for the segmentation of osteosarcoma MRI images are shown in Figure 9 below. We can visually examine the model's segmentation performance using ground-truth photographs. Meanwhile, we have selected the DSC metrics, and based on the following six segmentation examples of osteosarcoma, we can find that the TBNet network can obtain better segmentation results. When dealing with simple segmentation jobs, our technique can obtain the same segmentation outcomes as previous models. It performs better when dealing with tumor segmentation jobs with complex boundaries (cases 3 and 4). Our method, particularly in the specifics of the border contour of the osteosarcoma lesion, can obtain a more precise and comprehensive segmentation of the tumor borders with higher accuracy when compared to other methods that can handle it appropriately. Similarly, for the same osteosarcoma images with complex edge blurring characteristics as in cases 3 and 4 above, we selected the following six examples of osteosarcoma images with blurred edges and used TransUNet, and our proposed TBNet for image segmentation, respectively. As shown in Figure 10, our method can segment the boundary of the lesion region more effectively and precisely. Similarly, for the same osteosarcoma images with complex edge blurring characteristics as in cases 3 and 4 above, we selected the following six examples of osteosarcoma images with blurred edges and used TransUNet, and our proposed TBNet for image segmentation, respectively. As shown in Figure 10, our method can segment the boundary of the lesion region more effectively and precisely. Similarly, for the same osteosarcoma images with complex edge blurring characte istics as in cases 3 and 4 above, we selected the following six examples of osteosarcom images with blurred edges and used TransUNet, and our proposed TBNet for image se mentation, respectively. As shown in Figure 10, our method can segment the boundary the lesion region more effectively and precisely. We measured segmented data to quantitatively compare the effectiveness of the va ious methods by comparing several classical comparative metrics. The osteosarcoma ta is shown in Table 4. Our proposed model TBNet performs well overall in the task. Sever metrics of our scheme are the highest. For example, ACC, IOU, DSC, and Re reach 0.99 0.915, 0.949, and 0.969, respectively. The performance of the UNet network is al We measured segmented data to quantitatively compare the effectiveness of the various methods by comparing several classical comparative metrics. The osteosarcoma task is shown in Table 4. Our proposed model TBNet performs well overall in the task. Several metrics of our scheme are the highest. For example, ACC, IOU, DSC, and Re reach 0.997, 0.915, 0.949, and 0.969, respectively. The performance of the UNet network is also relatively good with an ACC value of 0.99. The MSRN network has the least params among all methods with only 12.42 M, and it also has a recall of 0.945. Comparatively, although the MSRN network also requires only 24.53 M parameters, it has significantly lower DSC values. For the PSPNet algorithm, it has the lowest FLOPS value but it has the worst IOU and DSC values. It indicates that the performance of this model is poor. Although our method has more parameters than UNet, MSFCN, and MSRN models, it has fewer parameters than some other convolution-based models. It has higher segmentation accuracy and faster computation speed with only a small number of additional parameters. Each model was trained for a total of 200 rounds, and 1 round was selected every 4 rounds for visualization. As can be seen from Figure 11, our method is the fastest to reach the best stability in terms of accuracy, recall, and F1 value. In Figure 11a, the accuracy ranking among the models is ours > TransUNet > Unet > FPN > MSRN > MSFCN. In Figure 11b, overall, the recall rate of our suggested method has been kept as high as possible, ensuring that missed diagnoses are avoided to the greatest extent possible. Finally, the F1-score of each model in the training process is compared with our method. From Figure 11c, we can see that our F1 value fluctuates to some extent, but its average value remains the highest. It can be seen that the model is more robust and accurate in processing MRI images of osteosarcoma with different morphological characteristics. Each model was trained for a total of 200 rounds, and 1 round was selected eve rounds for visualization. As can be seen from Figure 11, our method is the fastest to re the best stability in terms of accuracy, recall, and F1 value. In Figure 11a, the accur ranking among the models is ours > TransUNet > Unet > FPN > MSRN > MSFCN. In Fig  11b, overall, the recall rate of our suggested method has been kept as high as possi ensuring that missed diagnoses are avoided to the greatest extent possible. Finally, the score of each model in the training process is compared with our method. From Fig  11c, we can see that our F1 value fluctuates to some extent, but its average value rem the highest. It can be seen that the model is more robust and accurate in processing M images of osteosarcoma with different morphological characteristics.

Discussion
It is shown that the pre-screened TSF shows a robust and accurate classification pability, effectively filters redundant data, and improves segmentation performa while saving the running time of subsequent processing operations. The fast FFT-ba NLM algorithm effectively improves image segmentation and largely reduces the im of noise in osteosarcoma diagnosis.
We designed TBNet, a U-Net network with a fusion transformer and edge enha ment, for the osteosarcoma MRI Image segmentation problem. Several comparison s ies reveal that our model outperforms others in terms of assessment indices such as D and IOU. It offers significant benefits over other classical convolutional models for proper processing of contour features in the complicated tumor border problem. In a tion, comparative tests in terms of accuracy, recall, and F1 by systematic sampling of dom samples show that our model has greater stability. This ensures the stability

Discussion
It is shown that the pre-screened TSF shows a robust and accurate classification capability, effectively filters redundant data, and improves segmentation performance while saving the running time of subsequent processing operations. The fast FFT-based NLM algorithm effectively improves image segmentation and largely reduces the impact of noise in osteosarcoma diagnosis.
We designed TBNet, a U-Net network with a fusion transformer and edge enhancement, for the osteosarcoma MRI Image segmentation problem. Several comparison studies reveal that our model outperforms others in terms of assessment indices such as DSC and IOU. It offers significant benefits over other classical convolutional models for the proper processing of contour features in the complicated tumor border problem. In addition, comparative tests in terms of accuracy, recall, and F1 by systematic sampling of random samples show that our model has greater stability. This ensures the stability and accuracy of segmentation results for practical diagnosis. This is mainly due to the introduction of a cross-fusion channel transformer with edge-enhancement features to replace the original U-Net jump connection in this paper. This approach takes advantage of both a transformer and Unet to perform the cross-fusion of multi-scale channel information, bridge the semantic gap, and solve the problem of semantic hierarchy inconsistency. In addition, the algorithm improves the problem of fuzzy edge segmentation by enhancing the edge information. Therefore, compared with the classical UNet model, the recognition accuracy of our method is higher. To apply to developing countries, our model reduces the computational cost as much as possible while ensuring accuracy and stability.

Conclusions
In this study, an artificial intelligence-aided diagnosis system for osteosarcoma MRI pictures with edge-enhancement characteristics is suggested. The method employs strategies such as image pre-screening, noise-reduction processing, and a segmentation model using edge-enhancement characteristics. It is used mainly to improve the recognition accuracy of osteosarcoma and assist doctors to detect the lesion location of patients more quickly and accurately. Our model is more robust in processing the MRI images of osteosarcoma with different shapes.
However, the recognition accuracy of the TBNet model is relatively low for images without noise reduction and pre-screening. With the development of image processing technology, we will strive to improve the recognition accuracy of tumors in original images. In addition, improving the function of the auxiliary scheme in combination with the actual clinical performance is also the focus of research.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: All data analyzed during the current study are included in the submission. Data used to support the findings of this study are currently under embargo while the research findings are commercialized. Requests for data, 12 months after the publication of this article, will be considered by the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.