Classiﬁcation of Brain Tumors from MRI Images Using a Convolutional Neural Network

: The classiﬁcation of brain tumors is performed by biopsy, which is not usually conducted before deﬁnitive brain surgery. The improvement of technology and machine learning can help radiologists in tumor diagnostics without invasive measures. A machine-learning algorithm that has achieved substantial results in image segmentation and classiﬁcation is the convolutional neural network (CNN). We present a new CNN architecture for brain tumor classiﬁcation of three tumor types. The developed network is simpler than already-existing pre-trained networks, and it was tested on T1-weighted contrast-enhanced magnetic resonance images. The performance of the network was evaluated using four approaches: combinations of two 10 -fold cross-validation methods and two databases. The generalization capability of the network was tested with one of the 10 -fold methods, subject-wise cross-validation, and the improvement was tested by using an augmented image database. The best result for the 10 -fold cross-validation method was obtained for the record-wise cross-validation for the augmented data set, and, in that case, the accuracy was 96.56%. With good generalization capability and good execution speed, the new developed CNN architecture could be used as an e ﬀ ective decision-support tool for radiologists in medical diagnostics.


Introduction
Cancer is the second leading cause of death globally, according to the World Health Organization (WHO) [1]. Early detection of cancer can prevent death, but this is not always possible. Unlike cancer, a tumor could be benign, pre-carcinoma, or malign. Benign tumors differ from malign in that benign generally do not spread to other organs and tissues and can be surgically removed [2].
Some of the primary brain tumors are gliomas, meningiomas, and pituitary tumors. Gliomas are a general term for tumors that arise from brain tissues other than nerve cells and blood vessels. On the other hand, meningiomas arise from the membranes that cover the brain and surround the central nervous system, whereas pituitary tumors are lumps that sit inside the skull [3][4][5][6]. The most important difference between these three types of tumors is that meningiomas are typically benign, and gliomas are most commonly malignant. Pituitary tumors, even if benign, can cause other medical damage, unlike meningiomas, which are slow-growing tumors [5,6]. Because of the information mentioned above, the precise differentiation between these three types of tumors represents a very important step of the clinical diagnostic process and later effective assessment of patients.
The most common method for differential diagnostics of tumor type is magnetic resonance imaging (MRI). However, it is susceptible to human subjectivity, and a large amount of data is We wanted to examine the network's generalization capability for clinical studies and to show how the subject-wise cross-validation approach gives more realistic results for further implementation.
In this paper, we present a new CNN architecture for brain tumor classification of three tumor types: meningioma, glioma, and pituitary tumor from T1-weighted contrast-enhanced magnetic resonance images. The network performance was tested using four approaches: combinations of two 10-fold cross-validation methods (record-wise and subject-wise) and two databases (original and augmented). The results are presented using the confusion matrices and accuracy metric. A comparison with the comparable state-of-the-art methods is also presented.

Image Database
The image database, provided as a set of slices, used in this paper contains 3064 T1-weighted contrast-enhanced MRI images acquired from Nanfang Hospital and General Hospital, Tianjin Medical University, China from 2005 to 2010. It was first published online in 2015, and the last modified version was realized in 2017 [22]. There are three types of tumors: meningioma (708 images), glioma (1426 images), and pituitary tumor (930 images). All images were acquired from 233 patients in three planes: sagittal (1025 images), axial (994 images), and coronal (1045 images) plane. The examples of different types of tumors, as well as different planes, are shown in Figure 1. The tumors are marked with a red outline. The number of images is different for each patient.

105
The image database, provided as a set of slices, used in this paper contains 3064 T1-weighted 106 contrast-enhanced MRI images acquired from Nanfang Hospital and General Hospital, Tianjin  format. These images represent the input layer of the network, so they were normalized and resized 120 to 256 × 256 pixels.

121
In order to augment the dataset, we transformed each image in two ways. The first 122 transformation was image rotation by 90 degrees. The second transformation was flipping images 123 vertically [23]. In this way, we augmented our dataset three times, resulting in 9192 images.

Image Pre-Processing and Data Augmentation
Magnetic resonance images from the database were of different sizes and were provided in int16 format. These images represent the input layer of the network, so they were normalized and resized to 256 × 256 pixels.
In order to augment the dataset, we transformed each image in two ways. The first transformation was image rotation by 90 degrees. The second transformation was flipping images vertically [23]. In this way, we augmented our dataset three times, resulting in 9192 images.

Network Architecture
Tumor classification was performed using a CNN developed in Matlab R2018a (The MathWorks, Natick, MA, USA). The network architecture consists of input, two main blocks, classification block, Appl. Sci. 2020, 10, 1999 4 of 13 and output, as shown in Figure 2. The first main block, Block A, consists of a convolutional layer which as an output gives an image two times smaller than the provided input. The convolutional layer is followed by the rectified linear unit (ReLU) activation layer and the dropout layer. In this block, there is also the max pooling layer which gives an output two times smaller than the input. The second block, Block B, is different from the first only in the convolution layer, which retains the same output size as the input size of that layer. The classification block consists of two fully connected (FC) layers, of which the first one represents the flattened output of the last max pooling layer, whereas, in the second FC layer, the number of hidden units is equal to the number of the classes of tumor. The whole network architecture consists of the input layer, two Blocks A, two Blocks B, classification block, and output layer; altogether, there are 22 layers, as shown in Table 1.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 13 convolutional layer which as an output gives an image two times smaller than the provided input.

129
The convolutional layer is followed by the rectified linear unit (ReLU) activation layer and the 130 dropout layer. In this block, there is also the max pooling layer which gives an output two times  Classification Output 3 output classes, "1" for meningioma, "2" for glioma, and "3" for a pituitary tumor

Training Network
We used a k-fold cross-validation method to test the network performance [24]. Two different approaches were implemented, and both consisted of 10-fold cross-validation. The first approach was Appl. Sci. 2020, 10, 1999 5 of 13 to randomly divide the data into 10 approximately equal portions so that each tumor category was equally present in each portion, referred to as record-wise cross-validation. The second approach was to randomly divide the data into 10 approximately equal portions where the data from a single subject could only be found in one of the sets. Each set, therefore, contained data from a couple of subjects regardless of the tumor class, referred to as subject-wise cross-validation. The second approach was implemented to test the generalization capability of the network in medical diagnostics [25]. The generalization capability in clinical practice represents the ability to predict the diagnosis based on the data obtained from subjects from which there are no observations in the training process. Therefore, observations from individuals in the training set must not appear in the test set. If this is not the case, complex predictors can pick up a confounding relationship between identity and diagnostic status and so produce unrealistically high prediction accuracy [26]. In order to compare the performance of our network with other state-of-the-art methods, we also tested our network without k-fold cross-validation (one test). In all the above-mentioned methods, two data portions were used for the test, two for validation, and six for training. Both datasets, normal and augmented, were tested using all the methods.
The network was trained using an Adam optimizer, with a mini-batch size equal to 16 and data shuffling in every iteration. The early-stop condition that affects when the process of network training will stop corresponds to one epoch. More specifically, it was tuned to finish the training process after the one epoch, when the loss starts to increase. The regularization factor was set to 0.004, and the initial learning rate to 0.0004. The weights of the convolutional layers were initialized using a Glorot initializer, also known as Xavier initializer [27].
The training process was stopped when the loss on the validation set got larger than or was equal to the previous lowest loss for 11 times. The network was trained and tested on a single graphical processing unit (GPU), CUDA device, GeForce GTX 1050 Ti.

Results and Discussion
Results of the developed CNN are shown in Table 2 and visualized using the confusion matrices, as shown in Figures 3 and 5-7. In confusion matrices, non-white rows represent network output classes, and non-white columns correspond to real classes in Figures 3 and 5-7. The numbers/percentages of correctly classified images are shown on the diagonal. The last row represents the sensitivity, whereas the last column corresponds to the specificity. Overall accuracy is shown in the bottom-right field. The upper number in the non-white boxes corresponds to the number of images, and the lower number represents the percentage of the whole class database in the training or test set. In order to neglect the imbalance of classes of tumors in the database, we have also shown mean average precision, recall, and F1-score in Table 2.  Figure 3 shows confusion matrices for the record-wise 10-fold cross-validation approach for testing data obtained from the original dataset. The classification error for the testing set after cross-validation is equal to 4.6%.    Examples of classified images from the original dataset with record-wise 10-fold cross-validation are shown in Figure 4, with the tumors outlined in red.

197
Confusion matrices for the record-wise 10-fold cross-validation method for testing data from the 198 augmented dataset are shown in Figure 5. The classification error for the testing data is 3.4%. Confusion matrices for the record-wise 10-fold cross-validation method for testing data from the augmented dataset are shown in Figure 5. The classification error for the testing data is 3.4%.

204
This result shows that the network has a smaller accuracy than for record-wise cross-validation, 205 which was expected because the predictions were made based on previously unseen data.  This result shows that the network has a smaller accuracy than for record-wise cross-validation, which was expected because the predictions were made based on previously unseen data.

209
Confusion matrices for the subject-wise 10-fold cross-validation approach for testing data from 210 the augmented dataset are shown in Figure 7. The classification error for the testing data is 11.5%.

211
The proposed architecture of the CNN had only 4.3 million weights, and it obtained better 212 results with augmented data, which was expected because the data set is not especially extensive.

213
Even with the augmented data set, the subject-wise accuracy is lower than the accuracy obtained with  Confusion matrices for the subject-wise 10-fold cross-validation approach for testing data from the augmented dataset are shown in Figure 7. The classification error for the testing data is 11.5%.
The proposed architecture of the CNN had only 4.3 million weights, and it obtained better results with augmented data, which was expected because the data set is not especially extensive. Even with the augmented data set, the subject-wise accuracy is lower than the accuracy obtained with the record-wise cross-validation because, with the augmentation, we only increased the number of images for individual patients, not the number of patients. As a consequence of splitting the data with the subject-wise method, increasing the number of patients was more important. The first class of tumors, meningioma, had the lowest sensitivity and specificity for all four testing methods. This is easily explained given that meningioma is the hardest to discern from the other two types of tumors based Appl. Sci. 2020, 10, 1999 8 of 13 on the place of origin and overall features. The execution speed was quite good with an average of less than 15 ms per image.
the record-wise cross-validation because, with the augmentation, we only increased the number of 215 images for individual patients, not the number of patients. As a consequence of splitting the data 216 with the subject-wise method, increasing the number of patients was more important. The first class 217 of tumors, meningioma, had the lowest sensitivity and specificity for all four testing methods. This is 218 easily explained given that meningioma is the hardest to discern from the other two types of tumors

225
There are several papers that use the same database for brain tumor classification. In order to 226 compare our results with those of previous studies, we selected only those papers which had 227 designed neural networks, used whole images as input for classification, and tested their networks

Comparison with State-of-the-Art-Methods
There are several papers that use the same database for brain tumor classification. In order to compare our results with those of previous studies, we selected only those papers which had designed neural networks, used whole images as input for classification, and tested their networks with a k-fold cross-validation method, as shown in Table 3. We also compared our results with those of researchers who had not tested the network with k-fold cross-validation, as shown in Table 4. A comparison with the studies that used designed neural networks and an augmented dataset, but did not test it by k-fold cross-validation, is presented in Table 5.   In the literature, there are also studies that used the same database for classification with pre-trained networks [23,[35][36][37][38][39][40] or, as input, they use only tumor region or some features that are extracted from the tumor region [7, 21,23,41,42]. Similarly, in several papers, researchers have modified this database prior to classification [36,[43][44][45][46][47]. The designed networks are usually simpler than already-existing pre-trained networks and have faster execution speed. To our knowledge, the best results using the pre-trained network are 98.69% [40] and 98.66% accuracy [36]. Rehman et al. [40] preprocessed images with contrast enhancement and augmented the dataset. The augmentation was fivefold, with rotations of 90, 180, and 270 degrees and horizontal and vertical flipping. The best result was obtained with a fine-tuned VGG16 trained using stochastic gradient descent with momentum. Although our approach has a 1.41% higher classification error, it has 4.3 million weights as opposed to the VGG16, which is a very deep network with 138 million weights. Very deep networks such as VGG16 and AlexNet require dedicated hardware for real-time performance. Kutlu and Avcı [36] also modified the database, using only those images that were taken in the axial plane, and used only 100 images of each tumor type. For feature extraction, they used the pre-trained AlexNet, and, for testing, they performed a 5-fold cross-validation method. It is unclear how the algorithm will perform on the whole dataset and what its generalization capabilities are.
Developing the network which uses only the region of the tumor or some other segmented part as input is better in terms of speed of execution, but also requires methods for segmentation or a dedicated expert who would mark those parts.
To our knowledge, the best result in the literature using the segmented image parts as inputs are presented by Tripathi and Bag [41], with 94.64% accuracy. For input to the classifiers, they use features that are extracted from the segmented brain from the image. They tested their approach using a 5-fold cross-validation method.

Conclusions
A new CNN architecture for brain tumor classification was presented in this study. The classification was performed using a T1-weighted contrast-enhanced MRI image database which contains three tumor types. As input, we used whole images, so it was not necessary to perform any preprocessing or segmentation of the tumors. Our designed neural network is simpler than pre-trained networks, and it is possible to run it on conventional modern personal computers. This is possible because the algorithm requires many less resources for both training and implementation. The importance of developing smaller networks is also linked to the possibility of deploying the algorithm on mobile platforms, which is significant for diagnostics in developing countries [48]. In addition, the network has a very good execution speed of 15 ms per image. In order to test the network, we used record-wise and subject-wise 10-fold cross-validation on both the original and augmented image database. In clinical diagnostics, the generalization capability implies predictions for subjects from whom we have no observations. With this in mind, the observations from individuals in the training set must not appear in the test set. If this condition is not met, complex predictors can have unrealistically high prediction accuracy due to the confounding dependency between the identity and the diagnosis of a patient [26]. In relation to that knowledge, we have committed subject-wise cross-validation.
A comparison with the comparable state-of-the-art methods shows that our network obtained better results. The best result for 10-fold cross-validation was achieved for the record-wise method and, for the augmented dataset, and the accuracy was 96.56%. To our knowledge, in the literature, there is no paper that shows tested generalization, through subject-wise k-fold method, for this image database. For the subject-wise approach, we obtained an accuracy of 88.48% for the augmented dataset. The average test execution was less than 15 ms per image. These results show that our network has a good generalization capability and good execution speed, so it could be used as an effective decision-support tool for radiologists in medical diagnostics.
Regarding further work, we will consider other approaches to database augmentation (e.g., increasing number of subjects) in order to improve the generalization capability of the network. One of the main improvements will be adjusting the architecture so that it could be used during brain surgery, classifying and accurately locating the tumor [49]. Detecting the tumors in the operating room should be performed in real-time and real-world conditions; thus, in that case, the improvement would also involve adapting the network to a 3D system [50]. By keeping the network architecture simple, detection in real time could be possible. In future, we will examine the performance of our designed neural network, as well as improved ones, on other medical images.