Towards Robust and Accurate Detection of Abnormalities in Musculoskeletal Radiographs with a Multi-Network Model

This study proposes a novel multi-network architecture consisting of a multi-scale convolution neural network (MSCNN) with fully connected graph convolution network (GCN), named MSCNN-GCN, for the detection of musculoskeletal abnormalities via musculoskeletal radiographs. To obtain both detailed and contextual information for a better description of the characteristics of the radiographs, the designed MSCNN contains three subnetwork sequences (three different scales). It maintains high resolution in each sub-network, while fusing features with different resolutions. A GCN structure was employed to demonstrate global structure information of the images. Furthermore, both the outputs of MSCNN and GCN were fused through the concat of the two feature vectors from them, thus making the novel framework more discriminative. The effectiveness of this model was verified by comparing the performance of radiologists and three popular CNN models (DenseNet169, CapsNet, and MSCNN) with three evaluation metrics (Accuracy, F1 score, and Kappa score) using the MURA dataset (a large dataset of bone X-rays). Experimental results showed that the proposed framework not only reached the highest accuracy, but also demonstrated top scores on both F1 metric and kappa metric. This indicates that the proposed model achieves high accuracy and strong robustness in musculoskeletal radiographs, which presents strong potential for a feasible scheme with intelligent medical cases.


Introduction
Musculoskeletal diseases are common in medicine and severely affect the health and daily life of more than 1.7 billion people worldwide [1]. Musculoskeletal diseases are often accompanied by pain in muscles, bones, or joints and may be a precursor of more severe diseases [2]. Typically, doctors require information about patients' symptoms to determine whether further examinations are necessary. Medical images play an important role in the diagnosis of musculoskeletal diseases. Medical images are currently analyzed by doctors to determine abnormalities in patients. However, this procedure might be impacted because heavy workload was imposed on doctors every day. With the accelerated digitization of modern hospitals, computer-aided diagnostic systems based on medical images provide an effective means to assist doctors to obtain a more objective judgement and decrease the burden on doctors. Digital image technology is increasingly wide-spread in the field of medical imagery, among achieved state-of-the-art performance on a graph-like data set. However, the training and inference procedure is time-consuming. The authors of Reference [28] proposed improvements based on the work of Reference [27] and accelerated the speed of graph convolution by a factor of eight. However, this is still seven times slower than the classic CNN network, which limits the size of graph data. Therefore, graph convolution is typically conducted on small graph data for quick convergence [27,28]. This method is not capable to represent significant details in images, which may affect the overall accuracy of the module.
This paper presents an effective approach for the detection of musculoskeletal abnormalities with radiographs based on a novel deep learning framework. This framework consists of a multi-scale two-dimensional (2D) CNN, a fully connected graph convolution neural network, and a fusion module with the following main contributions:

1.
A preprocessing scheme of radiographs is proposed to create an identity map from the original image to the expected input image, utilizing an image padding method [29] to pad the original image with square proportions and then zooming it to the appropriate size; 2.
The network structure of the CNN is deeply analyzed and a multi-scale network structure with powerful discriminating ability and characteristics of high-resolution feature map is proposed. Three different resolution subnetwork sequences are adopted and each sequence is connected to all other sequences through upsampling or downsampling to perform salient feature fusion; 3.
A graph convolutional neural network is employed, with the aim to extract global structure information and context information of radiographs, while utilizing the embedding method [28] to abstract the image into graph data. Graph convolution is then conducted on the data to extract structure features and the context relationship, which is hidden in the graph data; 4.
The high accuracy and strong robustness of the proposed framework are demonstrated. This structure combines the two network streams via concatenation on the flat layer to perform structure feature and salient feature fusion. It can maintain high-resolution representations, while obtaining effective representations of the structural features.
The following consists of four parts: Section 2 describes related literature. Section 3 illustrates the proposed method. Section 4 shows the experiments results and discussion. Section 5 presents conclusion.

Related Works
Generally, medical image processing is an important problem of the application of computer vision in the medical field. Machine learning, especially deep learning, has played an important role in medical image representation and classification. To improve model performance, various network structures are proposed. Here, examples are chosen that are most relevant to introduce this work.

High Resolution Neural Network (HRNet)
HRNet is a CNN network with the ability to maintain high-resolution information over the whole course [23]. The entire network block is decomposed into several subnetworks. Let (N s,r ) be a subnetwork, where s represents the current depth, and r represents the resolution. The subnetwork in the first resolution is a network sequence that can be defined as: where n represents the length of the network sequence. A lower resolution subnetwork is added to gradually and in parallel extend the full network at axis scale. This can be defined as: Repeated multi-scale fusions are performed through upsampling and downsampling.

Graph Convolutional Network (GCN)
Inspired by the significant success of CNN in computer vision, many studies recently redefined the concept of convolution for graph data. These methods belong to the category of GCN [25]. The authors of Reference [27] proposed the first important study on GCN and developed a variant of graph convolution based on the spectral graph theory. This method transfers the filter and graph signal of the convolution network to the Fourier domain for processing at the same time. However, many parameters need to be adjusted in the training of the graph convolution method based on spectral graph theory. The authors of Reference [28] proposed a fast-localized spectral filtering method to perform convolution on graphs.

Proposed Method
The proposed abnormality detection method consists of four main components-a radiograph preprocessing method to generate the proper input, a 2D CNN that repeatedly fuses multi-scale salient features, a fully connected GCN that extracts structural features from the downsampled data, and a fusion module that concats the flattened layer of each network stream. The main contributions of this work are described in detail in the following.

Method of Radiographs Preprocessing
In general, radiographs have variable size due to differences in equipment and the acquisition environment [5]. Therefore, images need to be preprocessed to meet the input of the proposed network. Improper image preprocessing methods may affect model performance. Therefore, an effective data preprocessing method is very important for the task of abnormality detection via musculoskeletal radiographs. The aspect ratio of pixels in the radiographs reflects the proportion of body tissues and usually contains useful information. To maintain the aspect ratio information without changing the data distribution of the image, a feasible preprocessing method is proposed, which contains four main steps. The original image is defined as a matrix I W * H , where W represents the vertical dimension of the matrix, and H represents the horizontal dimension of the matrix. A matrix transformation function is defined to transform the image: where M H * W represents a matrix with the vertical dimension of H and the horizontal dimension of W. Columns values of M H * W range from column (H-W) to column H are all zero, and the rest are all one. This transforms the original image to a square with L as the edge value. Then, a shrink function is defined as follows: where R(•) represents the resize function to shrink the image to the appropriate size. The method with an equivalent pseudocode is summarized and listed in the following: 1.
Calculate the maximum value between the width and height as L.

2.
Create a new square image with L as the edge and 0 as each pixel value.

3.
Align the original image with the top left corner of the newly created image and merge both.

4.
Shrink the merged image expected size.
An abnormal case was chosen to intuitively demonstrate the algorithm process as shown in Figure 1. During the whole process, an origin aspect ratio of 512:413 is maintained.

Proposed Multi-Scale Convolution Neural Network (MSCNN)
Part I is the MSCNN, which can be described as follows: The MSCNN contains three subnetwork sequences. N rd was used to represent each subnetwork. Where r represents the resolution of current subnetwork, and d represents the depth of current subnetwork. Subnetworks with the same resolution form a network sequence. Subnetworks with different resolutions are connected by up-sampling or down-sampling. Convolution kernels in the convolution unit of the three subnetwork sequences are of different numbers with 64,128,256, respectively. Each kernel is a 3 × 3 filter with a stride size of 1 and a pad size of 1. Sizes of the corresponding feature maps are defined as W × H × 64, W/2 × H/2 × 128, and W/4 × H/4 × 256, where W represents the width of input image, and H represents the height of the input image. Inputs of each are defined as a feature map: X 1 ,X 2 ,. . . ,X s . The outputs are s response map: Y 1 ,Y 2 ,. . . ,Y s . The multi-scale fusion process can be defined as: where the a(•) function is defined as downsampling or upsampling. k represents from ith resolution to kth resolution. The downsampling unit is a 2-strided or 4-strided 3 × 3 convolution kernel with a pad size of 1 according to the target resolution. The upsampling unit contains two parts, firstly, simple nearest neighbor sampling is adopted to improve the resolution of feature maps with a factor of either two or four, depending on the target resolution. Then, 1 × 1 convolution are performed to align the number of channels between subnetworks with different resolutions. The output of MSCNN is flattened into a feature vector with size 1 × 1 × 512 as input into the fusion module.

Proposed Graph Convolution Network (GCN)
The second part is GCN and can be described as follows-for the GCN, nearest neighbor interpolation downsampling is adopted to downsample the original image to the size of W/8 × H/8, where W represents the width of input image, and H represents the height of the input image. The corresponding image is then converted to graph data as input g = (V, ε, w), where V represents vertices in the graph, ε represents a set of edges, and w represents a weighted adjacency matrix that describes the connection weight between each two vertices while all values in the matrix are set to 1. The GCN stream contains three main modules, graph convolution module, graph coarsening module, and graph pooling module. The combined stack of these three modules constitutes the network flow. The convolution operation on the graph is defined as: where represents the element-wise Hadamard product, U represents a matrix of eigenvectors, x represents the feature of the whole graph, and y represents the output of the convolution. Global average pooling is used as pooling operation on the graph. The GCN block contains six convolution units. Furthermore, the output of GCN is then flattened into a feature vector with size of 1 × 1 × 512 as the input into the fusion module.

Proposed Fusion Module
Output of multi-level fusion convolution network is concatenated with the output of the GCN to generate the final output. Here, V MSCNN was used to represent the feature vector generated by MSCNN and V GCN as the feature vector generated by GCN. The concat function was defined as V OUT = C(V MSCNN ,V GCN ), where V OUT represents the feature vector after both vectors have been joined in sequential order. V OUT is then passed into a fully connected layer to generate the 2D vector. This procedure can be seen in Figure 2. The final output is generated by the softmax function, which is defined as: where x represent the classes. P c (x) represents the probablity of the output to be class x.

Proposed Framework
The whole network architecture is shown in Figure 3. The overall pipeline of the proposed MSCNN-GCN framework is described in Figure 3. The network is trained through the batch training method. The loss function used in each network branch is the cross entropy, which can be defined as: where B represents the batch size used for training, Y represents the output of current batch, (M i , c i ) represents a pair of input data and label of current batch, and θ represents parameters in the network that need to be adjusted. The total loss of the entire network is a weighted sum of both MSCNN and GCN losses and the default value of each weight is set to 0.5. The stochastic gradient descent (SGD) method is used as optimizer. The proposed network is trained in an end-to-end manner.

MURA Dataset
The MURA dataset (MURA) is a large and representative dataset of musculoskeletal radiographs, collected by the Stanford ML group with the aim to lead to significant advances in medical imaging technologies, which can diagnose at the level of experts [30]. MURA contains 40,895 multi-view radiographic images, including seven body tissues (elbow, finger, hand, humerus, forearm, shoulder, and wrist), from 14,656 studies (12,173 patients) [21]. Each study was labeled as either normal or abnormal by radiologists [21]. As shown in Table 1 (showing the details of the MURA), the data set was separated into two parts, including training set (TS) and validation set (VS). The TS contains 13457 studies with 8280 normal cases and 5177 abnormal cases. The VS contains 1199 studies with 661 normal cases and 538 abnormal cases.

Evaluation Metrics
F1 score, accuracy, balanced accuracy and Cohen-kappa score were used as metrics in the task. F1 score is an indicator used to measure the accuracy of dichotomous model in statistics, and can be considered as a harmonic average of model precision and recall, with a maximum value of 1 and a minimum value of 0 [29]. The accuracy metric directly reflects the performance of the model while the Cohen-kappa score is a more robust metric that measures inter-rater agreement for qualitative or categorical items [31]. The F1 score can be defined as follow: where P represents precision, R represents recall while P and are defined as follows: Accuracy is defined as: where TP represents true positive, TN represents true negative, FP represents false positive, and FN represents false negative. Balanced accuracy is defined as: where TPR represents true positive rate and TNR represents true negative rate. The Cohen-kappa score [31] is defined as: where P O is equal to the accuracy defined above, and P e represents the hypothetical probability of chance agreement; to be specific, suppose the true sample number of each class is (a 1 , a 2 , . . . , a c ), and the number of samples predicted for each category is (b 1 , b 2 , . . . , b c ). Then, c represents the total number of class and the total number of samples is n. P e can be described as:

Results and Discussion
In this study, two sets of experiments were conducted: experiment A and experiment B, to evaluate the performance of our proposed framework. The same framework structure was applied to each category in the MURA dataset. All experiments were conducted with the MURA dataset and performed on four Nvidia GTX 2080Ti GPUs and Intel Xeon E5-2600 v4 3.60GHz CPU using the Pytorch framework.

Experiment A: F1 score (MSCNN-GCN, DenseNet169, Radiologists)
DenseNet169 is a 169-layer densely connected convolutional network which was developed to detect abnormalities in musculoskeletal radiographs in the MURA dataset by the Standford ML group [21]. In this experiment, DenseNet169, radiologists, and MSCNN-GCN were compared with F1 score on the MURA. The TS was split into 10 folds for each type of musculoskeletal radiographs based on a stratified sampling method to train the developed model. A 10-fold cross-validation approach was adopted to evaluate the performance of the trained model. Samples in VS were employed to verify the performance of the proposed framework.
As shown in Table 2, radiologists [21], DenseNet169 [21], and the proposed model were compared with regard to the F1 metric. The performance of the proposed model outperformed not only the DenseNet169, but even the radiologists which can be seen from Figure 4. Table 2. Accuracy and balanced accuracy obtained by the proposed framework.

Image
Train-Validation Accuracy Validation Accuracy Validation Balanced Accuracy  The feature vectors obtained by graph convolution and the feature vectors obtained by high-resolution multi-branch CNN were combined. In this manner, the global structural features and local significant features of the image were well fused, to obtain a more sufficient and effective representation of the image. Experimental results showed that this model achieved an accurate and robust diagnosis effect for abnormality detection of musculoskeletal diseases with an overall score of 90.9%, which is 2.5% higher than that achieved by radiologists and 5% higher than that of the DenseNet169 model. Besides, as shown in Figure 5, we also randomly selected two samples (one abnormal sample and one normal sample) of shoulder from the VS and visualized the salient features that contributed the most to the prediction output of the MSCNN-GCN given a target category in a heatmap manner. The Figure 5a is the abnormal sample which was predicted as abnormal by the MSCNN-GCN while the Figure 5b is the normal sample which was predicted as abnormal by the MSCNN-GCN. From where we can obtained that the MSCNN-GCN is sensitive to the joints of bones. The accuracy metrics of the proposed framework are listed in Table 3. Train-validation accuracy is the average accuracy of the 10-fold cross validation result while Validation accuracy is the result of VS.  CapsNet was proved to be capable of determining the abnormality in musculoskeletal radiography with good accuracy on the MURA dataset by authors in this work [32]. To ensure the fairness in the comparison experiments, the same dataset split method was used in both TS and VS as that applied by the authors in this work [32]. The Cohen-kappa scores are listed in Table 4. The figures in brackets are kappa scores for two categories (normal and abnormal) and those outside the brackets are means of the two kapa scores. As shown in Figure 6, these results were compared with those of two models (DenseNet169 [21], CapsNet [32]) that have been published on each type in MURA dataset.  Experiments were conducted to compare the performance of single MSCNN and MSCNN-GCN to demonstrate the benefits of combining GCN with MSCNN. As shown in Figure 7, the performance of MSCNN-GCN was higher than that that of MSCNN because of the contributions of GCN. The MSCNN branch in the proposed model can obtain the local feature representation of an image with different sizes of receptive fields and fuses multi-scale features via skip-connection, thus improving the reusing of features. Different sizes of receptive fields are activated at the same resolution in the proposed MSCNN branch so that more abundant local features are obtained. In this manner, the detailed information of the radiographs is better preserved. The GCN branch in the proposed framework can abstract images into graph data and extract features of graphs at low resolution; thus, the relationship between faraway pixel nodes could be identified. The overall structural features of the image are also well described.

Conclusions
This study presents a multi-network framework (MSCNN-GCN), which consists of a multi branch 2D CNN and a fully connected GCN, for the detection of automatic abnormalities in musculoskeletal radiographs. The performance of the developed framework surpasses that of current state-of-the-art models on the MURA with three evaluation metrics (accuracy, F1 score, and kappa score). The abnormalities-detection-level of the proposed model is also not worse than that of radiologists. The benefits of using multi-scale salient features in 2D CNNs were analyzed, which allowed the combination of both the detail information and contextual information to better describe the characteristics of radiographs. The advantages of using global structure information in GCNs are discussed, which enables the capture of additional feature information of nodes and connection relationships between nodes. Furthermore, an efficient feature-fusion architecture was proposed, which is designed to transform different types of features into a uniform feature space and concat operation was used to complete feature fusion, for the processing of bone radiographs. In the future, the potential of this method will be explored on more challenging tasks (such as lesion location or segmentation) for other diseases such as pulmonary nodules, arteriosclerosis, and lymph nodes on CT images to expand the application of the proposed framework. We have an optimistic expectation that the framework has a promising potential for the applications of deep learning methods in the field of intelligent medical cases.
Author Contributions: All authors contributed extensively to the study presented in this manuscript. S.L. and Y.G. contributed significantly to the conception of the study. S.L. designed the network and conduct the experiments. S.L. and Y.G. provided, marked, and analyzed the experimental results. Y.G. supervised the work and contributed with valuable discussions and scientific advice. All authors contributed in writing this manuscript. All authors have read and agreed to the published version of the manuscript.