Assembling Deep Neural Networks for Medical Compound Figure Detection

: Compound ﬁgure detection on ﬁgures and associated captions is the ﬁrst step to making medical ﬁgures from biomedical literature available for further analysis. The performance of traditional methods is limited to the choice of hand-engineering features and prior domain knowledge. We train multiple convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and gated recurrent unit (GRU) networks on top of pre-trained word vectors to learn textual features from captions and employ deep CNNs to learn visual features from ﬁgures. We then identify compound ﬁgures by combining textual and visual prediction. Our proposed architecture obtains remarkable performance in three run types—textual, visual and mixed—and achieves better performance in ImageCLEF2015 and ImageCLEF2016.


Introduction
With the development of the Internet, the amount of biomedical literature in electronic format has increased considerably [1].The large-scale figures in the articles have become valuable to medical education, medical research and clinical diagnosis [2].It is difficult to manage such a substantial repository of information appropriately.This prompts the development of methods for the automatic classification of medical figures in order to improve the ability to retrieve relevant images.
Many images in the biomedical literature (over 40% [3]) are compound figures (see Figure 1).Medical image retrieval systems, such as OPENi [4], have been cross-media-based, relying on the captions associated with the images as the input to the retrieval system.Users of OPENi can filter the compound figures by selecting "Image Type" as "Exclude Multipanel" to reduce the range of search.Compound figure identification is the first step to making compound images from the literature available for further analysis, such as multi-label classification, compound figure separation, subfigure modality classification, and caption generating.To facilitate research and development in this field, the Image Cross-Language Evaluation Forum (ImageCLEF) has run the medical task since 2004.The subtask of compound figure detection was first introduced in ImageCLEF2015 [5] and continued in ImageCLEF2016 [6].The goal of this subtask is to identify whether a figure is a compound figure or not.The subtask makes training data and test data containing compound figures and non-compound figures and their related captions from the biomedical literature available.This recognition task involves three run types-textual, visual and mixed.Characteristic Delimiters [9] and Bag-of-Words (BoW) are used to extract textual features.With respect to visual methods, most researchers focus on low level features such as Border Profile [9][10][11], Bag-of-Keypoints (BoK) [12], Bag-of-Colors (BoC) [13], and SURFContext [14].The best results are obtained by a combination of cross-media predictions [5,6].Although achieving a good performance, the handdesign features mentioned above are dependent on the choice of features and demand a clear awareness of the prior domain knowledge.Hence, it is hard to capture a substantial number of possible input variations very well.
Recently, deep neural networks attain remarkable achievements in not only computer vision (CV) but also natural language processing (NLP).Convolutional neural networks (CNNs) [15] have led to a series of breakthroughs for image classification [16][17][18].Within NLP, most of the work with deep learning models, such as Convolutional neural network (CNN) for sentence classification [19,20], long short-term memory (LSTM) [21], networks for sentiment classification [22,23], and gated recurrent unit (GRU) networks [24] for sentiment classification [23], involve learning word vector representations [25] and performing compositions over the learned word vectors for classification.
Our group of DUTIR (Information Retrieval Laboratory of Dalian University of Technology) took part in the subtask of compound figure detection in ImageCLEF2016 and achieved good performance [6].However, our textual runs based on CNN or (Recurrent Neural Network) RNN had not obtained state-of-the-art accuracy on textual run type.In this work, we employ several different neural networks and realize, without much surprise, that model combination performs better than any individual technique.We train several networks of CNN, LSTM, and GRU to learn features from captions over pre-trained word vectors to make textual predictions.As for visual prediction, we still feed original figures to make a visual model of multiple deep CNNs.Finally, we combine two types of results to identify compound figures.
We test our mixed models on datasets of both ImageCLEF2015 and ImageCLEF2016 and obtain better accuracies of 88.07% and 96.24%, compared to 85.39% in ImageCLEF2015 [5] and 92.7% in ImageCLEF2016 [6].We also evaluate rule-based neural networks on text or images and achieve good performance of both textual and visual prediction, respectively.However, our rule-based mixed model is unable to improve performance.These results show that our cross-media framework can effectively capture the "rule" information from a compound figure and its caption.

Methods
This section describes the architecture of assembling neural networks (NNs) (see Figure 2).Our system contains one cross-media mixed model without rules (Mixed Model) and two single-media models based on rules (Model of Delimiter and Model of Border).A Model of Delimiter is one textual rule model based on Delimiter Features and three deep learning methods of textual convolutional neural networks (TCNNs), textual long short-term memory networks (TLSTMs) and textual gated This recognition task involves three run types-textual, visual and mixed.Characteristic Delimiters [9] and Bag-of-Words (BoW) are used to extract textual features.With respect to visual methods, most researchers focus on low level features such as Border Profile [9][10][11], Bag-of-Keypoints (BoK) [12], Bag-of-Colors (BoC) [13], and SURFContext [14].The best results are obtained by a combination of cross-media predictions [5,6].Although achieving a good performance, the hand-design features mentioned above are dependent on the choice of features and demand a clear awareness of the prior domain knowledge.Hence, it is hard to capture a substantial number of possible input variations very well.
Recently, deep neural networks attain remarkable achievements in not only computer vision (CV) but also natural language processing (NLP).Convolutional neural networks (CNNs) [15] have led to a series of breakthroughs for image classification [16][17][18].Within NLP, most of the work with deep learning models, such as Convolutional neural network (CNN) for sentence classification [19,20], long short-term memory (LSTM) [21], networks for sentiment classification [22,23], and gated recurrent unit (GRU) networks [24] for sentiment classification [23], involve learning word vector representations [25] and performing compositions over the learned word vectors for classification.
Our group of DUTIR (Information Retrieval Laboratory of Dalian University of Technology) took part in the subtask of compound figure detection in ImageCLEF2016 and achieved good performance [6].However, our textual runs based on CNN or (Recurrent Neural Network) RNN had not obtained state-of-the-art accuracy on textual run type.In this work, we employ several different neural networks and realize, without much surprise, that model combination performs better than any individual technique.We train several networks of CNN, LSTM, and GRU to learn features from captions over pre-trained word vectors to make textual predictions.As for visual prediction, we still feed original figures to make a visual model of multiple deep CNNs.Finally, we combine two types of results to identify compound figures.
We test our mixed models on datasets of both ImageCLEF2015 and ImageCLEF2016 and obtain better accuracies of 88.07% and 96.24%, compared to 85.39% in ImageCLEF2015 [5] and 92.7% in ImageCLEF2016 [6].We also evaluate rule-based neural networks on text or images and achieve good performance of both textual and visual prediction, respectively.However, our rule-based mixed model is unable to improve performance.These results show that our cross-media framework can effectively capture the "rule" information from a compound figure and its caption.

Methods
This section describes the architecture of assembling neural networks (NNs) (see Figure 2).Our system contains one cross-media mixed model without rules (Mixed Model) and two single-media models based on rules (Model of Delimiter and Model of Border).A Model of Delimiter is one textual rule model based on Delimiter Features and three deep learning methods of textual convolutional neural networks (TCNNs), textual long short-term memory networks (TLSTMs) and textual gated recurrent unit networks (TGRUs).A Model of Border is composed of Border Features and visual convolutional neural networks (VCNNs).
In ImageCLEF2015 or ImageCLEF2016, we obtain one word vector vocabularies trained on 6.8 million words collected from all captions, using the word2vec tool created by Mikolov et al. [25], which provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words.The vectors have a dimensionality of 300.Words not presented in the set of pre-trained words are initialized randomly.
Each sentence is wrapped to a window of 300 words to reduce the number of parameters.Maximum sentence length is 300, longer sentences are truncated and shorter sentences are padded with zeros at the end.
Using binary cross entropy to define the loss function, we separately trained three textual neural network models of TCNNs, TLSTMs, and TGRUs (see Figure 2) on top of the pre-trained word vectors.We set word2vec vectors as our embedding layer's weights and use Glorot uniform [27] initializer to initial training weights of our neural networks.After gathering the outputs of three textual NNs models, we feed them to the next step of cross-media combination (discussed in a later subsection) or a textual combination based on Delimiter Features.

Textual Convolutional Neural Networks
The model of TCNNs, shown in Figure 2, is similar to the CNN architecture [19,28] with an embedding layer with a dropout of 0.25, a convolutional layer with a kernel size of 3 and 250 feature maps, a max-over-time pooling layer with a max-pool size of 2, a vanilla layer with 250 neurons, a dropout of 0.25, and a ReLU (Rectified Linear Unit) activation function, and a full-connected layer with 2 neurons.The softmax function is implemented at the final layer to output the prediction probabilities of two classes.After training multiple (e.g., 5) networks through the RMSProp optimizer [29] over shuffled mini-batches of 32, we take averaging prediction results as the output of the model.
In ImageCLEF2015 or ImageCLEF2016, we obtain one word vector vocabularies trained on 6.8 million words collected from all captions, using the word2vec tool created by Mikolov et al. [25], which provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words.The vectors have a dimensionality of 300.Words not presented in the set of pre-trained words are initialized randomly.
Each sentence is wrapped to a window of 300 words to reduce the number of parameters.Maximum sentence length is 300, longer sentences are truncated and shorter sentences are padded with zeros at the end.
Using binary cross entropy to define the loss function, we separately trained three textual neural network models of TCNNs, TLSTMs, and TGRUs (see Figure 2) on top of the pre-trained word vectors.We set word2vec vectors as our embedding layer's weights and use Glorot uniform [27] initializer to initial training weights of our neural networks.After gathering the outputs of three textual NNs models, we feed them to the next step of cross-media combination (discussed in a later subsection) or a textual combination based on Delimiter Features.

Textual Convolutional Neural Networks
The model of TCNNs, shown in Figure 2, is similar to the CNN architecture [19,28] with an embedding layer with a dropout of 0.25, a convolutional layer with a kernel size of 3 and 250 feature maps, a max-over-time pooling layer with a max-pool size of 2, a vanilla layer with 250 neurons, a dropout of 0.25, and a ReLU (Rectified Linear Unit) activation function, and a full-connected layer with 2 neurons.The softmax function is implemented at the final layer to output the prediction probabilities of two classes.After training multiple (e.g., 5) networks through the RMSProp optimizer [29] over shuffled mini-batches of 32, we take averaging prediction results as the output of the model.

Textual Long Short-Term Memory Network
The model of TLSTMs, shown in Figure 2, applies dropout within the LSTM layer similar to [23,30].The architecture of this model contains an embedding layer with a dropout of 0.5, an LSTM layer with a dropout of 0.5, a full-connected layer with 2 neurons, and a softmax layer.We train multiple (e.g., 5) networks through Adam (Adaptive Moment Estimation) [31] over shuffled mini-batches of 32 and make the prediction by averaging all results.

Textual Gated Recurrent Unit Network
The model of TGRUs, shown in Figure 2, also applies a dropout within the GRU layer similar to [23,24].The model consists of an embedding layer with a dropout of 0.5, a GRU layer with a dropout of 0.5, a full-connected layer with 2 neurons, and a softmax layer.After training multiple networks through Adam [31] over shuffled mini-batches of 32, we also combine them by averaging all results.

Textual Rule Model of Delimiter
When captions of compound figures are written, it is most likely that existing subfigures are addressed using some delimiters.Therefore, we extract three-dimensional Characteristic Delimiter features [9] from the caption of the input figure.We select delimiters with the highest occurrence in the captions referred as "Characteristic Delimiters."By analyzing these captions from the training set, we finally select three delimiter pairs of "a, b", "a), b)" and "(a), (b)" to compute a feature vector by detecting whether one pair of them exists.
Then we feed these features to a Logistic Regression (LR) classifier to output prediction probabilities of compound figures.After that, we combine the output with other NNs models by assigning the deciding vote of positive prediction to the LR classifier.Specifically, for each sample, we take the output of the LR classifier as the prediction when it makes a positive prediction, or use the output of the NN models.The new outputs of these rule-based models are taken as the textual predictions of input samples.

Visual Convolutional Neural Networks
Before inputting the figures into VCNNs, shown in Figure 2, we resize them to a square of N × N pixels (where N = 32, 64, 128, etc.) and prepare a Python version of the dataset similar to the CIFAR-10 Python dataset.
The model of VCNNs is a deep CNN similar to [28,32,33].The first two convolutional layers contain 32 kernels of size 3 × 3, and the second two convolutional layers have 32 kernels of size 3 × 3. The second and fourth convolutional layers are interleaved with pooling layers of dimension 2 × 2 with a dropout of 0.25.Then, a full-connected layer with 512 neurons and a dropout of 0.5 is followed by a full-connected layer with 2 neurons.ReLU activation function is applied to all four convolutional layers and the first full-connected layer.The softmax function is implemented at the final layer to output the prediction probabilities of two classes.We use Glorot uniform to initial training weights and train the model using stochastic gradient descent (SGD) over shuffled mini-batches of 32.

Visual Rule Model of Border
A highly distinguishing feature characterizing a compound figure is the existence of a separating border.We extract four-dimensional Border Features similar to [9] from input figures to describe the presence of these horizontal or vertical and black or white borders.
Firstly, we resize the figure to a square of 256 × 256 pixels and detect the presence of borders.We choose a strict detecting range of [80,170] from two directions to attain greater precision.Then we feed these features to the Logistic Regression classifier with the same parameters as the textual rules of the Model of Delimiter and combine the results with the NN model in a way similar to the Model of Delimiter.The new outputs of this model are taken as the visual prediction of input samples.

Mixed Method
The predictions are combined mentioned above using the same average strategy: where y is the prediction class label, the function of σ(•) returns the mean of the input predicted probabilities of k models, and the function of argmax(•) refers to the input x, at which the output of average is maximum.
Our Mixed Model: after combining three textual models with the visual model separately, we fuse the three cross-media combinations without rules by using average strategy again to make a final mixed prediction of the current sample.

Experiments
In this section, we describe baseline models, which get the highest accuracies of Compound Figure Detection task in ImageCLEF 2015 and ImageCLEF 2016, in comparison with our proposed models.Then, we present the experimental results of our approaches as well as the baseline.

Dataset
For our experiments, we utilize the ImageCLEF 2015 and ImageCLEF2016 Compound Figure Detection dataset [5,6] using a subset of PubMed Central.This task makes training data and test data available containing compound and non-compound figures from the biomedical literatures.ImageCLEF provides figures and associated captions, each pair of which is labeled as compound figure (COMP) or non-compound figure (NOCOMP).In ImageCLEF 2015 [5], the training set contains 10,433 figures and the test set 10,434 figures.Each of these two sets contains 6144 compound figures.
In ImageCLEF 2016 [6], they expand their training set to 20,997 figures and reduce the size of the test set to 3456.These two sets contain 12,348 and 1806 compound figures, respectively.
In accordance with the evaluation criterion of the benchmark, we evaluate our approach based on two-class (COMP and NOCOMP) classification accuracy for all experiments unless otherwise stated.After training the networks with 10-fold cross validation (10FCV) on the training set, we test our trained models on the test set.Many codes have been modified from our previous work [28,32,33] and are implemented with the neural network library of Keras, running on top of TensorFlow.All default parameters are used, except for those parameters mentioned in Section 2. Our networks are trained on one NVIDIA Tesla K20c GPU in a 64 bit Dell computer with two 2.40 GHz CPUs, 64 G main memories in Dalian, China, and Ubuntu 12.04.

Baselines
This section describes the baseline methods, and their results in both ImageCLEF2015 and ImageCLEF2016.

ImageCLEF2015
Pelka et al. [9] obtained the best textual result with an accuracy of 78.34% in ImageCLEF2015 labeled as Baseline_Text (see Table 1).With respect to textual features, they used Bag-of-Words (BoW) approach using the provided figure caption.They also detected the presence of some delimiters characterizing compound figure, which were manually selected by analyzing the associated captions from the training set.After concatenating BoW features and Characteristic Delimiter features, they fed them to a random forest (RF) classifier. 1 "82.24 ± 0.16" means the average accuracy of 82.24 with the standard deviation of 0.16. 2 "Models" contains three textual NNs ("TCNN", "TLSTM", and "TGRU") models, three rule-based textual NNs models ("Rule_TCNN", "Rule_TLSTM", and "Rule_TLSTM"), two kinds of combination, and a rule-based Model of Delimiter, in addition to baseline model of "Baseline_Text".The suffix "1" or "5" indicates the number of networks trained.
Wang et al. [10] obtained the best visual result with an accuracy of 82.82% in ImageCLEF2015 labeled as Baseline_Figure (see Table 2).They fused two different schemes to identify the compound figure.The connected component analysis-based scheme mainly addressed the issue of the presence of the connected text in the compound figures using special ratio criterions.The peak region detection-based scheme leveraged pixel intensity to find borders. 1 "Models" contain visual CNN ("VCNN1" and "VCNN5") models and one visual rule model based on Border Features and CNNs ("Model of Border"), in addition to baseline model of "Baseline_Text".The suffix "1" or "5" indicates the number of networks trained.
Pelka et al. [8] achieved the best result using a multi-modal approach, with an accuracy of 85.39% in ImageCLEF2015 labeled as Baseline_Mixed (see Table 3).They extracted Bag-of-Keypoints (BoK) features and Borders Features from figures.After reducing the feature dimensions using principal component analysis, they concatenated the visual features and the textual features mentioned above and fed them to the random forest (RF) classifier.In ImageCLEF2016, very good results were obtained for the compound figure detection task, reaching up to 92.7% for our team (DUTIR) labeled as Baseline_Mixed (see Table 3).Our team also obtained the best visual result with an accuracy of 92.01%, labeled as Baseline_Figure (see Table 2), by combining five deep convolutional neural networks.MLKD obtained the best textual result of 88.13% labeled as Baseline_Text (see Table 1).

Textual Results
We use the following protocol for all textual experiments: maximum input sentence length of 300, a mini-batch size of 32, and an epoch number of 6.We choose these values via a grid search on the ImageCLEF2015 dataset.
From Table 1, we can see that the performance of any type of neural network (NN) models (TCNN1, TLSTM1, and TGRU1) have been better than the baseline of textual models both in ImageCLEF2015 and in ImageCLEF2016.Inspired by the work of [28,32,33], we train five networks with the same structure for each of three textual models and combine them separately (TCNN5, TLSTM5, and TGRU5, see Table 1) to improve performance further.For example, among three types of textual models, the textual convolutional neural network (TCNN5) obtains the performance of 82.30% in ImageCLEF2015 and 89.81% in ImageCLEF2016, which are both higher than textual baseline of 78.34% and 88.13% (Baseline_Text).
After feeding the characteristic delimiters features to an LR classifier with the inverse of regularization strength of 1 × 10 5 , implemented on Scikit-Learn tools, we obtain a lower accuracy of 81.42% and 83.91% (see Table 1).However, a further observation shows that the Model of Delimiter has higher precisions than those of other NN models (see Table 4) in both 2015 and 2016.They also surpass the accuracy of the NN models for the same subset samples, which are identified as positive by the LR classifier trained on Delimiter Features (see Table 4).This advantage of Delimiter results in that all NN models improve their performance when combined with the textual rule (see Table 1).For example, the accuracy of the model of five rule-based textual Gated Recurrent Unit networks (Rule_TGRU5) gains a more than 0.80% point increase (from 82.40% to 83.24% in 2015 and from 88.72% to 89.53% in 2016).Finally, we combine all three rule-based textual NN models to construct a Model of Delimiter to identify compound figures and obtain state-of-the-art textual accuracy of 83.24% in ImageCLEF2015 and 90.25% in ImageCLEF2016.

Visual Results
We use the following protocol for the experiment of visual neural networks model: a -batch size of 32, an epoch number of 30, and an image size of 64.Similar to our textual models, we also train five deep CNNs with the same structure on figures, and combine their results to identify compound figures.This combination obtains an over 3% point increase (from 80.83% to 84.24% in 2015 and from 89.99% to 92.33% in 2016) (see Table 2).
We find a similar advantage when observing the performance of the visual models.For the same subset samples recognized as positive by the LR classifier trained on Border Features, it has a higher accuracy than those of NNs model (92.86% versus 90.76% in 2015 and 95.58% versus 95.32% in 2016) (see Table 5).These results can explain well why rule-based neural networks (Model of Border) attain a better visual accuracy of 86.28% in 2015 and 93.66% in 2016 (see Table 2).

Mixed Results
We combine three types of textual neural networks with visual networks separately and compare their prediction probabilities.The results show that assembling all combinations achieves more stable performance than any single one (see Table 3).Our system obtains better accuracies of 88.07% in 2015 and 96.24% in 2016.
We also test to combine textual and visual neural networks based on rules to identify compound figures.From Table 3, we find that introducing rule information into our cross-media combination cannot effectively improve performance of prediction.When combining all three mixed models based on rule, it even harms the performance (from 88.07% to 87.52% in 2015 and from 96.24% to 96.18% in 2016).In a similar way, we create a subset whose samples are identified as positive by Logistic Regression trained on Delimiter or Border Features and compare the performance of rule models with three cross-media combinations without rules (see Tables 6 and 7).We further explore the results to find the reasons that the performance drops after we include rules in the system.On one hand, from Tables 4 and 5, we can see the rule methods have very badly recalls (take visual methods' as an example: 56.94% in 2015 and 59.91% in 2016), although their precisions are better than neural networks methods based on single media.On the other hand, our cross-media methods have better precision than the LR classifiers based on rules (see Tables 6 and 7).These results show that our cross-media combinations can capture well the compound figure information from both figures and captions.

Running Time
The training times of our networks are listed in Table 8.For comparative purposes, we present the running time on training or testing one sample excluding data preprocessing times.For deep learning models, we record the training time in one epoch.From Table 8, we find that the traditional feature-engineering method is much faster than neural networks method.In our experiments, CNNs tend to be much faster than RNNs.

Conclusions
We have presented a system for medical compound figure identification that is composed of two independent parts, the visual and the textual, which are combined by averaging the prediction probabilities.Our system achieves promising performance in the ImageCLEF2015 and ImageCLEF2016 compound figure detection tasks of the visual, the textual, and the mixed.We hope to include new techniques into our system and focus on improving the state of the art for this task.

Figure 2 .
Figure 2. Architecture of assembling deep neural networks for compound figure detection.

Table 1 .
Accuracy of textual methods.

Table 2 .
Accuracy of visual methods.

Table 3 .
Accuracy of mixed methods.

Table 4 .
Comparison of performance between the textual NN methods and Delimiter Features.

Table 5 .
Comparison of performance between the neural network method and Border Features."Accuracy" in this table refers to identification success rate for the same subset samples, which are identified as positive by Logistic Regression trained on Border Features. 1

Table 6 .
Comparison of performance between mixed methods and Delimiter method.

Table 7 .
Comparison of performance between mixed methods and Border method.

Table 8 .
Training and test time of our neural networks.