Channel and Spatial Attention Regression Network for Cup-to-Disc Ratio Estimation

: Cup-to-disc ratio (CDR) is of great importance during assessing structural changes at the optic nerve head (ONH) and diagnosis of glaucoma. While most efforts have been put on acquiring the CDR number through CNN-based segmentation algorithms followed by the calculation of CDR, these methods usually only focus on the features in the convolution kernel, which is, after all, the operation of the local region, ignoring the contribution of rich global features (such as distant pixels) to the current features. In this paper, a new end-to-end channel and spatial attention regression deep learning network is proposed to deduces CDR number from the regression perspective and combine the self-attention mechanism with the regression network. Our network consists of four modules: the feature extraction module to extract deep features expressing the complicated pattern of optic disc (OD) and optic cup (OC), the attention module including the channel attention block (CAB) and the spatial attention block (SAB) to improve feature representation by aggregating long-range contextual information, the regression module to deduce CDR number directly, and the segmentation-auxiliary module to focus the model’s attention on the relevant features instead of the background region. Especially, the CAB selects relatively important feature maps in channel dimension, shifting the emphasis on the OD and OC region; meanwhile, the SAB learns the discriminative ability of feature representation at pixel level by capturing the relationship of intra-feature map. The experimental results of ORIGA dataset show that our method obtains absolute CDR error of 0.067 and the Pearson’s correlation coefﬁcient of 0.694 in estimating CDR and our method has a great potential in predicting the CDR number.


Introduction
As the second leading cause of blindness, glaucoma is a disease that causes damage to the optic nerve of the eyes resulting in deteriorated vision [1]. Once diagnosed, the disease cannot be treated completely, but timely detection can further control the effect of glaucoma. Therefore, early detection and treatment are essential for glaucoma patients to safeguard their vision [2][3][4].
Various diagnosis parameters of glaucoma are proposed, such as CDR, ISNT rule, DDLS, GRI (Glaucoma risk index) [5], which are used for assessing structural changes at the optic nerve head (ONH) and diagnosis of glaucoma. The CDR is widely regarded as one of the crucial indications • Proposed channel attention block can select relatively important feature maps in channel dimension, shifting the emphasis on the feature map that is closely related to the optic disc and cup region, and proposed spatial attention block can capture the relationship of the intra-feature map to improve the discriminative ability of feature representation at pixel level. • We design a segmentation-auxiliary task to help the regression task focus on the optic disc and cup region instead of non-optic disc and cup region. Overview of Channel and Spatial Attention Regression Network, which combines a deep convolution neural network (CNN) for cropping optic disc and optic cup areas, the encode, two parallel attention modules called channel attention block (CAB) and spatial attention block (SAB), the decode, and a multitask relationship learning module for cup-to-disc ratio number estimation. Here, the "conv" is denoted as convolutional layer, the "down_rep" represents three down-sampling layers and the "conv-64" means convolution layer with kernel size 64 × 64.
The remaining of the paper is as follows. In Section 2, we briefly review some methods of calculation of CDR number and some networks that use attention mechanism. In Section 3 the architecture of the CSAR is presented. Then we give the dataset, experiment details, results and discussions in Section 4. Section 5 gives the conclusion.

Related Work
In this section, some methods of calculation of CDR number are briefly reviewed, and we also present some networks that use attention mechanism.

Existing Two Methods of the Calculation of CDR
Automatic diagnosing glaucoma algorithms have been recognized by more and more people, such as a computer-aided diagnosis system [7,[13][14][15][16]. Some glaucoma detection methods use CDR from spectral domain optical coherence tomography (OCT) Images [17][18][19][20][21][22][23], and others are from fundus images. There are two categories of solutions existing for calculation of CDR number: segmentation methods and direct estimation methods.

•
Segmentation methods. Most researchers focus their efforts on the segmentation methods, part of which tend to focus on OD and OC segmentation independently. For the perspective of only OD's segmentation, hand-crafted features are inevitably required, such as image gradient information extracted by active contour model [24], the local texture features [25], the disparity features extracted from the stereo image [26], and a novel polar transform method [27]. OC segmentation is also highly dependent on hand-crafted visual features. However, because the OD and OC have a certain structural similarity and positional correlation, joint OD and OC segmentation approaches obtain better performance [28,29]. Zheng et al. [28] joint the OD and OC segmentation leveraging a graph-cut mechanism. A superpixel-level classifier [29] is utilized to provide robustness for segmenting OD and OC. Recently, deep learning techniques have an excellent performance in computer vision, which are also widely used to joint OD and OC segmentation. Sevastopolsky et al. [30] design a modification of the U-net convolutional neural network (CNN), but it still operates in two stages. Futhermore, based on U-net, Qin et al. [31] combine deformable convolution and create a novel architecture for segmentation of OD and OC. Subsequently, the residual network (ResNet) is introduced, and whether generative adversarial networks (GAN) is helpful for OD and OC segmentation is discussed in [32]. M-net [12] develops the one stage multi-scale mechanism and adopts polar transformation to shift the fundus images to the polar coordinate system. However, finding the center point of OD is inevitably required, and the workload is increased to some extent. Unsupervised domain adaptation for joint OD and OC segmentation over different retinal fundus image datasets is exploited in [33]. The work in [34] deals with the OD and OC by combining the GAN. The segmentation problem is addressed as an object problem [35]. • Direct estimation methods. Direct methods usually deduce the CDR numbers directly without segmented OD and OC. The existing method using machine learning has two stages: unsupervised feature representation with CNN and CDR number regression by random forest regressor separately [36].

Existing Attention Model
It is proved that the attention mechanism has been successfully adopted in CNN, significantly boosting the performance of many vision tasks [37][38][39][40]. Self-attention [41] is first proposed and applied in the domain of natural language processing (NLP). Recently it has also gained attention in the domain of computer vision [42][43][44][45]. The essence of the self-attention mechanism is to emphasize or select important information of target objects and suppress some irrelevant details through a series of attentional distribution coefficients, namely weight coefficients. The attention mechanism especially self-attention mechanism can flexibly capture the connection between local information and global information in one step, improving the model's presentation ability. Moreover, small and light structure is another advantage of attention mechanism. In particular, the non-local network [42] computes the response at a position as a weighted sum of the features at all positions. Based on the covariance matrix of the non-local mechanism, Du et al. [43] design a new self-attention mechanism stimulated by PCA to generate attention maps, achieving better interaction. Woo et al. [44] put forward the spatial attention mechanism to distinguish the importance of different positions. To the best of our knowledge, few methods combine the attention model in the CNNs for glaucoma detection. Only one method [45] is proposed which introduces the ophthalmologist's attention map into AG-CNN to remove the redundancy of the fundus image; however, a human attention map is inevitably required.

Methodology
In this section, we first present an overview of our CSAR model, then the architecture of two attention blocks is introduced. Next, we describe how to aggregate the segmentation-auxiliary module for the regression task, and the train loss is presented in the end.

Overview
Our model bases on traditional U-net and the general architecture of the model is as follows. Firstly, the encoder module contains four encoder blocks, and the residual network block is employed as the backbone for each block. After that, enter the attention mechanism: the channel attention block (CAB) and the spatial attention block (SAB). The structure of the self-attention mechanism is shown in Figures 2 and 3. The purpose of CAB is to acquire the connection between different channels automatically. The parallel SAB takes effect on the connections within the pixel, weighting different regions in the one feature map to make the regression model focus on the relevant feature areas, highlighting the salient regions. The attention module proposed is the independent allocation of weights within and between feature maps, and the mechanism can be used for weight learning through backpropagation. Then, the decoder module symmetrically expands the path. Finally, feature maps flow into the segmentation-auxiliary module and regression module respectively, and the segmentation-auxiliary task transmits back the label information to guide the feature extraction. Similarly, the regression output is obtained by a CNN network. We note that our regression branch has a straightforward structure.

The Spatial Attention Block
From the global and local perspective, images often have different change rules. The cup and disc area and their shapes are focus of our attention. However, characteristics of the background area (such as blood vessels traversing the cup boundary) tend to cause some harm to the foreground (the OD and OC area); at the same time, the OC is inside the OD, which is essential position information that cannot be ignored.
In order to realize our observation, we propose the spatial attention block (SAB). The architecture of our proposed SAB is shown in Figure 2 and its function is: learn the discriminative ability of feature representation at the pixel level by capturing the relationship of the intra-feature map. The process can be divided into three parts: first, the weight matrix is generated according to the similarity degree between the pixels in the feature; second, the weight matrix and the original feature matrix multiplication; third, the matrix addition between the matrix obtained above and the input characteristics. To be specific: the input characteristics simultaneous feed into three convolution layers, attaining three new feature maps E1, E2, E3. After reshaping them, a matrix multiplication and SoftMax layer are performed between the transpose of E2 and E3, and attention map is acquired: where S ij m marks the ith position's impact on jth position. For the feature at a particular position in one feature map, it is refreshed via aggregating feature at all pixels with weighted summation [46]. In short, any two existing similarity features can enhance each other's expression. Features with more similar features (such as cups and plates) will be enhanced, and features with fewer similar features (such as blood vessels and background) will be less enhanced.
After that, we multiply the attention map and the transpose and reshape of E1, and then add the input feature I to redistribute the correlation information to the original feature map: where a convolutional layer with kernel size 1 × 1 is set as a learnable parameter λ s . Through the above calculation, the attention map about the intre-feature map can well obtain the correlation between the global information and the location information and strengthen the closely similar features, thus solving the two problems we proposed.

The Channel Attention Block
Compared with the spatial attention block to capture global information, the channel attention block selects relatively important feature maps, shifting the emphasis on the optic disc and cup region.
Since each feature map in the channel dimension can be regarded as a class-specific corresponding [46], we use the interdependencies among channel maps to decompose the interdependent feature maps and improve some feature representations of specific semantics. In the field of glaucoma, there are usually several types of response: OD, OC, blood vessels, and other background areas. Through observation and experiments, however, it can be found that channel information of optic disc and cup accounts for a relatively large proportion, and there is a small amount of learned blood vessel information and other background areas. The weight matrix of channel similarity established by CAB can effectively enhance the corresponding degree of the cup and disc response.
The design of the CAB in Figure 3 is more similar to SAB. Unlike SAB which feeds input into three convolution layers, CAB only goes through one convolution layer, the process of calculating the weight matrix between features is similar, and the calculation formula is shown as: where C ij m marks the ith channel's impact on jth channel. And C j f inal with a size of C × H × W is defined as: where a convolutional layer with kernel size 1 × 1 is set as a learnable parameter λ c .

The Segmentation-Auxiliary Module
In general, the segmentation task and the regression task are carried out independently. However, mutual promotion can be achieved after the combination of the two tasks for the following reasons: • Direct regression deep learning network is more like a black box, and we cannot understand which features are well mapped in the regression. Meanwhile, it is tough to select features that could represent the boundary of the optic cup and disc. The addition of the segmentation-auxiliary module makes the features about OD and OC have a specific prominent enhancement in the feature selection of the regression map. The experimental results prove that our conclusion is correct. • The addition of the segmentation auxiliary module can improve the degree of network convergence.
The structure is shown in Figure 4: when getting the feature maps after the decode module, the model enters the segmentation-auxiliary path and the regression path. The segmentation label is obtained by a convolution layer and a SoftMax layer. Similarly, the final feature map in the decode module is passed through a convolution layer, then the final CDR number is acquired after three down-sampling layers and a convolution layer with kernel size 64 × 64.
The operation mechanism of our proposed networks is shown in Figure 4. Firstly, the training includes four parts: convolutional neural network for feature display, the dual-attention mechanism for enhancing the characteristics of OD and OC, and regression network supplemented by segmentation to help the regression optimization. In the test, the segmentation-auxiliary module is canceled, the result is only composed of regressed CDR number. Our model can provide a simple and convenient tool for doctors. Figure 4. The segmentation-auxiliary task is introduced into our model which is parallel to the main regression task. During the test, our model works in an implicit way for which image segmentation information is considered but not displayed

Training Loss
Segmentation loss. In segmentation branch, the Jaccard Index (intersection over union) is used and it means the intersection of two data sets divided by the union of two data sets, which is expressed as: From the pixel angle, the formula can be rewritten as follows: where y i represents the labels andŷ i represents the i t h predicted pixel. The final L s is expressed as follows: where H represents a categorical cross entropy. Regression loss. During regression task, the mean square error is used to define the regression loss L r , which is the sum of the squares of the differences between the predicted value and the target value.
Total loss. In the experiment, we first try to to add up the different losses simply. It soon becomes clear that although split tasks converge, regression task performs poorly. After further study, it is found that the scale between the two task losses is different, leading to the overall loss dominated by the first task. In order to balance the two tasks, we adopt the method in [47]. According to the two definitions above, like [47], the joint loss is expressed as follows: where Θ is a parameter, and the role of σ 1 and σ 2 is to balance the weight of two tasks and they are learnable in training. To be specific, the ultimate goal can be seen as learning the relative weight of each subtask output. In practice, to escape a potential division by zero, δ = log(σ) is redefined. Therefore, the final loss can be rewritten as:

Experiments and Analysis
The effectiveness of our CSAR is verified in different aspects. Firstly, ablation study is used to test the performance of two attention blocks CAB and SAB for the regression network. Simultaneously we calculate the mean absolute error (MAE) and the correlation. Furthermore, the area under curve (AUC) is computed to evaluate our method on glaucoma screening. Then, by visualizing the attention map to verify our conclusions, the accuracy of the model can be improved more intuitively.
Secondly, to evaluate the segmentation-auxiliary module's performance, the ablation study is also conducted by us. As described above, the experimental results with and without the segmentation-auxiliary module are still compared from three aspects: correlation coefficient, MAE and glaucoma screening accuracy. In addition, visualization of convolutional layers is presented to explain more clearly the problems that we encounter in the experiment.
Thirdly, we compare our CSAR to other traditional methods and deep learning methods, for example R-Bend [25], ASM [48], Superpixel [29], M-net [12], JointRCNN [35]. Our experiments still verify on the dataset ORIGA and use the same criteria. Finally, in this experiment, we also discuss and evaluate the ISNT rules.

Datasets and Configurations
The ORIGA dataset contains 650 fundus images with 482 normal eyes and 168 glaucomatous eyes. The set A including 325 images is used for training and the set B is used for testing [49]. In our experiment, the same division of the dataset is used as same as [12]. In order to detect the OD and OC in retinal fundus images based on their original resolution, we crop the 512 × 512 area based on the OD localization approach proposed by [12]. Since the training dataset is too small, We doubled the size of the dataset, which includes contrast enhancement of the fundus image and horizontal inversion.
In the experiment, our CSAR model is based on Python and PyTorch framework. During training, stochastic gradient descent (SGD) is employed for optimizing the CSAR model, and the initial learning rate is set as 0.0001. As the number of training epochs increases, the learning rate continues to decline.

Evaluation Criteria
In this work, diagnosis results that are obtained from experts are used as the gold standard for screening for glaucoma and the CDR numbers calculated by the segmentation of OD and OC by experts are used for CDR estimation. We evaluate our model on the three criteria as follows:

Absolute CDR Error
We use the absolute CDR error δE as one of the evaluation metrics, and it is defined as: where CDR h represents the experts CDR from the trained experts, and CDR p is calculated after the segmentation masks of OD and OC is obtained.
We also use the mean absolute error (MAE) which calculates the mean value of all samples' error rates, and it is defined as: 11) where N represents the number of test samples.

Pearson's Correlation Coefficient
In order to measure the degree of correlation between predicted CDR and hand-labeled CDR, the statistical index of correlation coefficient is used and its definition is expressed as: where CDR h s still are the experts CDR from the trained experts, and CDR p s still is the CDR calculated after the segmentation masks of OD and OC is obtained.

Screening for Glaucoma
We treat the obtained CDR as a probability number and calculate the receiver operating characteristic (ROC) curves and area under curve (AUC) to evaluate our method on glaucoma screening.

Ablation Study
For showing the effect of two attention blocks (SAB and CAB), we conduct the experiments of ablation study on the ORIGA dataset [49]. The result of different settings is shown in Table 1.
Baseline model. Our baseline model is based on standard U-net, as the end of the model, we design a simple regression deep network to attain CDR number.
Baseline model + CAB. On the basis of 1, CAB is introduced. Baseline model + SAB. Similarly, SAB is introduced by itself. Baseline model + CAB + SAB. The full model consists of the U-net regression model, SAB and CAB.  Table 1 shows that both CAB and SAB achieve better performance. When the baseline model combines CAB, the MAE/AUC is 0.0728/0.830; when the baseline model combines SAB, the mean absolute error is 0.0739/0.795; When simultaneously adding SAB and CAB into the models, the proposed attains 0.0698/0.831 the best performance. Therefore, the combined learning of CAB and SAB can achieve excellent results.
The results of Pearson's correlation coefficient are shown in Figure 5. As shown in Figure 5, with the addition of the two modules, the correlation between the predicted data and the data manually labeled by the doctor gradually increases.

Attention Map Visualization
In proposed CSAR, CAB models in channel dimension which selects relatively important feature maps, shifting the emphasis on the OD and OC region; meanwhile, the SAB learns the discriminative ability of feature representation at pixel level by capturing the relationship of intra-feature map. What is remarkable about SAB is that it makes the classification of pixels on the boundary area more accurate. Here, we verify our conclusions by comparing visual feature maps. Figure 6 shows the visualization of attention maps, and we select maps #1-3 from all feature maps in two types of attention maps.
The experimental results show that the SAB can learn the discriminative ability of feature representation at the pixel level and capture the OD and OC area information. Map #1 does fine for the background area, map #2 is relatively sensitive to the rim area, and map #3 selectively segments the OC. For CAB, some features are selected to see whether it relatively suppresses semantic information with low responsiveness. For example, the response of vessel semantic information is relatively suppressive after CAB.

Absolute CDR Error
In order to visually see that the auxiliary module is effective, we conduct an ablation study. The results are shown in Table 1.
In Table 1, the mean absolute error of the baseline model is 0.0823; when the baseline model combined with segmentation-auxiliary model, the mean absolute error is 0.0722. After the baseline model combines CAB, SAB and SAM, the final mean absolute error is 0.0671.

Attention Map Visualization
The segmentation-auxiliary can focus the model's attention on the relevant features instead of the background region. In our experiments, the condition of map #1 is found which is shown in Figure 7, we could see the non-OD region is more responsive, to be specific, the CDR obtained by regression is from the background region, which is not what we want. To address this problem, the segmentation-auxiliary module is proposed due to the fact that the attention of OD and OC is easy to focus but it is difficult to learn via the single regression task. Figure 7. Examples of special circumstances. First row: the CDR obtained by regression is from the background region, second row : the CDR obtained by regression is from the OD and OC region.Here the "down_reg i" represents the ith down-sampling layer.

Comparison with Exist Methods
We compare CSAR model with several start-of-art models, such as relevant-vessel bends (R-Bend) [25], active shape model (ASM) [48], Superpixel [29], Joint U-net + PT [12], M-net [12], JointRCNN [35]. The result is shown in Table 2. It can be seen that the proposed model obtains the best MAE with similar result of AUC. For the absolute CDR error, compared with the hand-craft methods, the deep learning methods are better. R-Bend [25] copes with the variations of OD regions by utilizing multidimensional feature space. ASM [48] takes advantage of the circular Hough to transform initialization to segment. The above two approaches do not obtain satisfactory performance. By contrast, Superpixel [29] addresses the OD and OC segmentation as a pixel classification task and obtains relatively satisfactory results. M-net [12], JointRCNN [35] obtain good results. In the paper, our proposed CSAR achieves a smaller error than those above. It demonstrates that the attention blocks and segmentation-auxiliary model are useful to guide the CDR calculation.
For the AUC results, the following conclusions are reached: (1) The non-deep learning method called Superpixel [29] surprisingly achieves excellent performance in screening glaucoma.
(2) Compared with the traditional methods, the current methods combining deep leaning such as M-net [12], JointRCNN [35] successively increase. (3) In particular, experimental results in the screening of glaucoma is similar to the other two proposed deep learning methods. Our model reduces the CDR error rate while ensuring that the AUC is similar. The ROC curves of different methods are shown in Figure 8. For the Pearson's correlation coefficient, compared with R-Bend [25] (0.38), Superpixel [29] obtains relatively satisfactory results (0.59). Joint U-net and M-net also obtain good results (0.617, 0.671). In the paper, our proposed CSAR achieves better results than those above. It demonstrates that jointly leaning of attention blocks and regression network is useful to predict the CDR number.
In our experiment, we also use the T-test to compare our method with other methods. The test results (p-value, t) reveal the difference between the existing method (<0.01, 5.506) and the proposed method, such as R-bend (<0.01, 3.077), Superpixel (<0.01, 4.021), and M-net (<0.01, 4.948).
In our testing, our method costs only 0.06s to regress the final CDR number for one fundus image on NVIDIA Tesla GPU. This is faster than most existing methods, such as R-Bend (4 s), ASM (4s), Superpixel (10 s), and M-net (0.5 s).

Discuss
In addition to the CDR, ISNT rule is utilized for screening for glaucoma [50]. The ISNT rule is the ordering of rim area of inferior, superior, nasal and temporal regions in order as: where I represents inferior regions, S represents superior regions, N represents nasal regions, and T represents temporal regions. The samples which follow this rule are considered as healthy while others are suspected as glaucomatous. These four area markers are shown in Figure 9. In our experiment, We replace the network that returns CDR number with a regression network that simultaneously returns the four numbers ISNT. Then, samples meeting the above rules will be considered normal, otherwise glaucoma. We can find that the rim error is more significant than the ratio error. The AUC calculated by ISNT is 0.587, which is lower than CDR measurement but higher than ISNT that obtained from the labels manually marked by ophthalmologists (0.540). The main reason, presumably, lies in the fact that the measurement method of ISNT has a lot to do with their position, and the attention of regression mapping is challenging to be accurate. Simultaneously compared with the AUC computed by experts CDR values, the AUC computed by ISNT that obtained from the labels manually marked by ophthalmologists is relatively low, maybe because it reacts poorly to non-glaucomatous large optic cups samples.
Complexity analysis plays an important role in measuring the efficiency of an algorithm. We calculate the number of parameters and the FLOPs to evaluate our algorithm. In our experiment, compared with the standard U-net, our model makes the number of parameters and the FLOPs increase to a certain extent. In future, self-attention mechanism can be further created for capturing global information, and the GPU memory friendly and High computational efficiency can be an excellent sign to guide the future research work.

Conclusions
A multitask deep network is proposed to directly regress CDR number and simultaneously combine two attention blocks called SAB and CAB. It is the first time that the regression network combines the self-attention mechanism and is used for screening for glaucoma. Auxiliary leaning of segmentation is first employed in CDR estimation. Experimental results show that more accurate numbers can be produced by our CSAR model. Thus, we believe that the proposed attention blocks are easily applied to other tasks because of its simplicity and lightness, such as image segmentation, image registration and so on. In future studies, this is what will be tested.