Protein Crystal Instance Segmentation Based on Mask R-CNN

: Protein crystallization is the bottleneck in macromolecular crystallography, and crystal recognition is a very important step in the experiment. To improve the recognition accuracy by image classiﬁcation algorithms further, the Mask R-CNN model is introduced for the detection of protein crystals in this paper. Because the protein crystal image is greatly affected by backlight and precipitate, the contrast limit adaptive histogram equalization (CLAHE) is applied with Mask RCNN. Meanwhile, the Transfer Learning method is used to optimize the parameters in Mask R-CNN. Through the comparison experiments between this combined algorithm and the original algorithm, it shows that the improved algorithm can effectively improve the accuracy of segmentation.


Introduction
Protein crystallography is an important subject for studying structure biology. The three-dimensional structural characterization of biological macromolecules is very important in order to understand their mechanism of action. The crystallography method is widely used in the drug discovery also, especially in the fragment-based drug screening [1][2][3].
At present, 169,436 protein structures have been deposited in the Protein Data Bank (PDB), and more than 88% of them are resolved by the X-ray crystal diffraction method ( http://www1.rcsb.org). To crystallize the protein, in most situations, it is still necessary for researchers to observe the samples through a microscope and to determine whether the crystallization process is completed. The observation would cost time and labor. Therefore, to design an automated system for protein crystallography, from protein purification to crystal growth, becomes an urgent requirement in the field of life sciences [4][5][6][7]. Some work has been done to classify the images of protein crystals [8,9]. Because they relied on a special imaging system, which was relatively rare at that time, it could not achieve good classification results. The image of protein crystal is seriously affected by the performance of backlight and the focusing of the microscope. Therefore, the quality of the classification algorithm based on a traditional machine learning method was totally dependent on the design of a feature vector, and the classification task could not be well realized. Bruno et al. [10] proposed a classification algorithm based on the deep convolutional neural network to classify protein crystallization results in 2018, which can achieve about 94% of the classification effect. However, it is only able to show whether there are crystals in the droplet. If we wanted to know where crystals are in the droplets from image, the classification task would not be able to solve this problem. Meanwhile, some commercial devices have been developed to satisfy the requirement from academic centers and pharmaceutical companies. The fully automatic crystallization imaging system, Rockimager1500, was designed by the Formulatrix [11] company. However, it was only able to analyze the experimental drop and not crystals in the drop.
Other optical methods also have been adopted for more trials. To reduce the influence of optical scattering, the second-order nonlinear optical imaging method was used to identify protein crystals [7]. The second harmonic signal that was generated by light and materials was used to search crystals. Meanwhile, it was able to help observe smaller crystals because a second harmonic generation (SHG) signal can frequently be observed from structures that are approximately the same size as or even smaller than the lateral resolution of the microscope. However, the method is more suitable for drug crystals with chiral. Therefore, the scope of application is limited. Compared with the above mentioned studies, none of them could well meet the requirements of researchers. Image segmentation can determine pixels which belong to the object or background in the image. Therefore, this paper attempts to use the instance segmentation algorithm to better identify protein crystals in the drop.
Mask R-CNN was proposed by the Facebook Researcher KM He [12], which integrates target detection and instance segmentation. In this paper, based on the Mask R-CNN method, the collected protein crystal image is marked as a suitable format that is in accordance with the format of the Microsoft Common Objects in Context (MS COCO) dataset for network training. The self-built protein crystal dataset was trained using the pre-trained network weight in the way of data transfer, and the Mask R-CNN network was fine-tuned.

Network Introduction
As protein crystal images are highly affected by the light and lens focus during collection process, a pre-processing module was added before the frame of Mask R-CNN to process the input images, which could better highlight the features of protein crystals in the image. The improved Mask R-CNN structure is shown in Figure 1. and pharmaceutical companies. The fully automatic crystallization imaging system, Rockimager1500, was designed by the Formulatrix [11] company. However, it was only able to analyze the experimental drop and not crystals in the drop.
Other optical methods also have been adopted for more trials. To reduce the influence of optical scattering, the second-order nonlinear optical imaging method was used to identify protein crystals [7]. The second harmonic signal that was generated by light and materials was used to search crystals. Meanwhile, it was able to help observe smaller crystals because a second harmonic generation (SHG) signal can frequently be observed from structures that are approximately the same size as or even smaller than the lateral resolution of the microscope. However, the method is more suitable for drug crystals with chiral. Therefore, the scope of application is limited. Compared with the above mentioned studies, none of them could well meet the requirements of researchers. Image segmentation can determine pixels which belong to the object or background in the image. Therefore, this paper attempts to use the instance segmentation algorithm to better identify protein crystals in the drop.
Mask R-CNN was proposed by the Facebook Researcher KM He [12], which integrates target detection and instance segmentation. In this paper, based on the Mask R-CNN method, the collected protein crystal image is marked as a suitable format that is in accordance with the format of the Microsoft Common Objects in Context (MS COCO) dataset for network training. The self-built protein crystal dataset was trained using the pretrained network weight in the way of data transfer, and the Mask R-CNN network was fine-tuned.

Network Introduction
As protein crystal images are highly affected by the light and lens focus during collection process, a pre-processing module was added before the frame of Mask R-CNN to process the input images, which could better highlight the features of protein crystals in the image. The improved Mask R-CNN structure is shown in Figure 1. The output of Mask R-CNN is divided into three parts: the prediction box regression, the image classification, and the mask branch. Among them, the prediction box regression and the image classification belong to the target detection part, while the mask branch belongs to the instance segmentation part.
In Mask R-CNN structure, the protein crystal image is input into the network, and then different feature maps are output by means of a series of convolution and pooling in feature pyramid networks (FPN). After that, different feature maps are delivered into the region proposal networks (RPN) so as to extract the region of interest (ROI). Then the ROI is input to the ROI Align to perform pixel correction on the feature map for subsequent target classification and bounding box regression. In the mask branch, the original images are cropped using the corrected bounding box, and then the images in ROI are performed by mask prediction. Therefore, the object in the bounding box belongs to the two-class classification problem (0: background, 1: object). This can avoid inter-class competition The output of Mask R-CNN is divided into three parts: the prediction box regression, the image classification, and the mask branch. Among them, the prediction box regression and the image classification belong to the target detection part, while the mask branch belongs to the instance segmentation part.
In Mask R-CNN structure, the protein crystal image is input into the network, and then different feature maps are output by means of a series of convolution and pooling in feature pyramid networks (FPN). After that, different feature maps are delivered into the region proposal networks (RPN) so as to extract the region of interest (ROI). Then the ROI is input to the ROI Align to perform pixel correction on the feature map for subsequent target classification and bounding box regression. In the mask branch, the original images are cropped using the corrected bounding box, and then the images in ROI are performed by mask prediction. Therefore, the object in the bounding box belongs to the two-class classification problem (0: background, 1: object). This can avoid inter-class competition and the final result belongs to instance segmentation. The total loss function of Mask R-CNN is defined as where L cls is classification loss; L box is regression loss of bounding box; L mask is semantic segmentation loss.

Pre-Processing Module
It is necessary to use image enhancement technology to enhance the contrast of protein crystal images and to highlight the features of protein crystals in the image. Histogram equalization (HE) is a common contrast enhancement method in gray space. Firstly, the frequency of each pixel level is counted by means of histogram equalization. Then, cumulative distribution function (CDF) is used in the acquired frequency as a result that the pixels of original image are mapped to new pixels through CDF, and this transformation process is a nonlinear transformation. Because CDF is a monotonic increasing function, after transformation, the brighter areas in the original image are still brighter. Histogram equalization is a nonlinear mapping method which is performed over the entire gray image instead of focusing on the local features of the image. Although the pixels are stretched in the more concentrated area of the gray image to make the dynamic range expand, noise information may be enhanced by HE in images, and for those images that contain obviously brighter or darker areas, it often fails to achieve a significant enhancement effect.
In order to better deal with local features, this paper uses the contrast limited adaptive histogram equalization (CLAHE) algorithm to preprocess images [13]. The input images are divided into m × n areas in CLAHE, and these areas are processed separately. Firstly, the gray histogram of each area is calculated. Next, the parts of the histogram above the threshold are cropped and then are accumulated. Then, the accumulated result is averagely distributed to each pixel level. The slope of the cumulative distribution function could be effectively limited by a cropping operation. Enhancement of neighborhood noise around the pixel is mainly caused by the slope of transformation function, and the noise that is around the pixel is proportional to the cumulative distribution function of the neighborhood. Therefore, once the slope of the cumulative distribution function is limited, the noise could be effectively limited. Then, the limited gray histogram is equalized to obtain the pixel mapping relationship. The pixel information of the edge between regions is discontinuous. As a result, a block effect occurs. Therefore, bilinear interpolation is used to fix the block effect in images. Meanwhile, the bilinear interpolation can also improve the computational efficiency. The effect of CLAHE is shown in Figure 2. and the final result belongs to instance segmentation. The total loss function of Mask R-CNN is defined as where is classification loss; is regression loss of bounding box; is semantic segmentation loss.

Pre-Processing Module
It is necessary to use image enhancement technology to enhance the contrast of protein crystal images and to highlight the features of protein crystals in the image. Histogram equalization (HE) is a common contrast enhancement method in gray space. Firstly, the frequency of each pixel level is counted by means of histogram equalization. Then, cumulative distribution function (CDF) is used in the acquired frequency as a result that the pixels of original image are mapped to new pixels through CDF, and this transformation process is a nonlinear transformation. Because CDF is a monotonic increasing function, after transformation, the brighter areas in the original image are still brighter. Histogram equalization is a nonlinear mapping method which is performed over the entire gray image instead of focusing on the local features of the image. Although the pixels are stretched in the more concentrated area of the gray image to make the dynamic range expand, noise information may be enhanced by HE in images, and for those images that contain obviously brighter or darker areas, it often fails to achieve a significant enhancement effect.
In order to better deal with local features, this paper uses the contrast limited adaptive histogram equalization (CLAHE) algorithm to preprocess images [13]. The input images are divided into × areas in CLAHE, and these areas are processed separately.
Firstly, the gray histogram of each area is calculated. Next, the parts of the histogram above the threshold are cropped and then are accumulated. Then, the accumulated result is averagely distributed to each pixel level. The slope of the cumulative distribution function could be effectively limited by a cropping operation. Enhancement of neighborhood noise around the pixel is mainly caused by the slope of transformation function, and the noise that is around the pixel is proportional to the cumulative distribution function of the neighborhood. Therefore, once the slope of the cumulative distribution function is limited, the noise could be effectively limited. Then, the limited gray histogram is equalized to obtain the pixel mapping relationship. The pixel information of the edge between regions is discontinuous. As a result, a block effect occurs. Therefore, bilinear interpolation is used to fix the block effect in images. Meanwhile, the bilinear interpolation can also improve the computational efficiency. The effect of CLAHE is shown in Figure 2.

FPN Module
FPN is a multi-scale feature fusion network structure which was proposed by the team of KM He in 2017 [14]. FPN is different from the traditional image pyramid structure. It is divided into three parts: bottom-up, top-down, and horizontal-connection. The structure is shown in Figure 3.

FPN Module
FPN is a multi-scale feature fusion network structure which was proposed by the team of KM He in 2017 [14]. FPN is different from the traditional image pyramid structure. It is divided into three parts: bottom-up, top-down, and horizontal-connection. The structure is shown in Figure 3. ResNet101 was adopted as a feature extraction network for obtaining different feature maps , , , , in the bottom-up structure. The scale relative to the original image is , , , , . Up-sampling operation is continuously performed from the to layers in the top-down structure. The up-sampling is performed by means of nearest neighbor up-sampling in the top-down structure, and the purpose of upsampling is to double the scale of the upper-layer feature map. Through the horizontal connection, the up-sampled high-level features can be fused with the low-level features, which can better integrate the semantic information and location information, and can also use different scale features more effectively. After fusion, the 3 × 3 convolution kernel will be used to process the fused features in order to eliminate the aliasing effect of up-sampling.
Finally, the , , , , which are generated in FPN are sent to the RPN for generating the region proposal, and the region proposal are performed for target detection in RPN. There are three priori boxes with aspect ratios 2: 1,1: 1,1: 2 generated on each pixel from different feature maps in RPN. The scale of the priori box increases as the scale of the feature map decreases. At the same time, , , , are sent to Fast RCNN and are combined with the region proposal that is output by RPN to perform regression and classification on the detection frame and the recognized object. The RPN loss function is

ROI Align Module
The ROI Align method was proposed in Mask R-CNN. Because it does not perform the quantization and rounding of coordinates of the ROI area, the problem of mis-alignment between the feature map and the original image in ROI pooling was solved by ROI ResNet101 was adopted as a feature extraction network for obtaining different feature maps [C 1 , C 2 , C 3 , C 4 , C 5 ] in the bottom-up structure. The scale relative to the original image is 1 2 , 1 4 , 1 8 , 1 16 , 1 32 . Up-sampling operation is continuously performed from the C 5 to C 3 layers in the top-down structure. The up-sampling is performed by means of nearest neighbor up-sampling in the top-down structure, and the purpose of up-sampling is to double the scale of the upper-layer feature map. Through the horizontal connection, the up-sampled high-level features can be fused with the low-level features, which can better integrate the semantic information and location information, and can also use different scale features more effectively. After fusion, the 3 × 3 convolution kernel will be used to process the fused features in order to eliminate the aliasing effect of up-sampling.
Finally, the [P 2 , P 3 , P 4 , P 5 , P 6 ] which are generated in FPN are sent to the RPN for generating the region proposal, and the region proposal are performed for target detection in RPN. There are three priori boxes with aspect ratios [2 : 1, 1 : 1, 1 : 2] generated on each pixel from different feature maps in RPN. The scale of the priori box increases as the scale of the feature map decreases. At the same time, [P 2 , P 3 , P 4 , P 5 ] are sent to Fast RCNN and are combined with the region proposal that is output by RPN to perform regression and classification on the detection frame and the recognized object. The RPN loss function is

ROI Align Module
The ROI Align method was proposed in Mask R-CNN. Because it does not perform the quantization and rounding of coordinates of the ROI area, the problem of mis-alignment between the feature map and the original image in ROI pooling was solved by ROI Align. The structure of ROI Align is shown in Figure 4. The region with dotted line represents the generated feature maps, and the rectangle region surrounded by a solid line represents the ROI that has been adjusted. The ROI is divided into 5 × 5 cells. If the number of samples in each cell is 4, each cell will be averaged divided into four bins, and the center of each bin is the sampling point. Since the coordinates of the ROI are floating-point numbers, the coordinates of the sampling points are usually also floating-point numbers. Therefore, bilinear interpolation is adopt for each sampling point pixel, as shown by the arrow in Figure 3. This operation can be used to obtain the pixel value of the sampling point, and Align. The structure of ROI Align is shown in Figure 4. The region with dotted line represents the generated feature maps, and the rectangle region surrounded by a solid line represents the ROI that has been adjusted. The ROI is divided into 5 × 5 cells. If the number of samples in each cell is 4, each cell will be averaged divided into four bins, and the center of each bin is the sampling point. Since the coordinates of the ROI are floating-point numbers, the coordinates of the sampling points are usually also floating-point numbers. Therefore, bilinear interpolation is adopt for each sampling point pixel, as shown by the arrow in Figure 3. This operation can be used to obtain the pixel value of the sampling point, and then four sampling points are performed max pooling on each cell. Finally, the ROI Align output are obtained.

Experiment Platform
The software environment for the experiment platform is based on Windows 10. The experiment framework is keras 2.2.4 and tensorflow 1.13. The CPU is AMD R5 3600. The memory is 16G, and the graphics processing unit (GPU) is NVIDIA RTX2060. In order to effectively utilize the GPU resources, the scale of the original image is adjusted to 512 × 512 before training. The area with no image is filled with black edges, and then, adjusted images are input into the network for training.

Experiment Dataset
The COCO dataset is a dataset which is provided by the Microsoft company. It can be used for image segmentation or target detection. In this paper, the weight model is obtained by pre-training on this dataset, and the crystal images are downloaded from Machine Recognition of Crystallization Outcomes (MARCO) (https://marco.ccr.buffalo.edu/). Because the MARCO dataset is only used for classification, the ground truth of downloaded crystal images is annotated by colleagues with the background of protein crystallography.
The Labelme software (https://github.com/wkentaro/labelme/tree/v3.11.2, version is 3.16.2) is used to annotate the crystals as masks in the image, and according to the format of the COCO dataset, these annotated images are designed as a crystal dataset. Corresponding json files, yaml files, and mask files are generated in the crystal dataset. The labeled mask image is shown in Figure 5.

Experiment Platform
The software environment for the experiment platform is based on Windows 10. The experiment framework is keras 2.2.4 and tensorflow 1.13. The CPU is AMD R5 3600. The memory is 16G, and the graphics processing unit (GPU) is NVIDIA RTX2060. In order to effectively utilize the GPU resources, the scale of the original image is adjusted to 512 × 512 before training. The area with no image is filled with black edges, and then, adjusted images are input into the network for training.

Experiment Dataset
The COCO dataset is a dataset which is provided by the Microsoft company. It can be used for image segmentation or target detection. In this paper, the weight model is obtained by pre-training on this dataset, and the crystal images are downloaded from Machine Recognition of Crystallization Outcomes (MARCO) (https://marco.ccr.buffalo. edu/). Because the MARCO dataset is only used for classification, the ground truth of downloaded crystal images is annotated by colleagues with the background of protein crystallography. The Labelme software (https://github.com/wkentaro/labelme/tree/v3.1 1.2, version is 3.16.2) is used to annotate the crystals as masks in the image, and according to the format of the COCO dataset, these annotated images are designed as a crystal dataset. Corresponding json files, yaml files, and mask files are generated in the crystal dataset. The labeled mask image is shown in Figure 5.

Experiment Results and Analysis
There are two important evaluation indicators for the performance of the classification problem. One is precision, which is used to evaluate how many objects are correctly

Experiment Results and Analysis
There are two important evaluation indicators for the performance of the classification problem. One is precision, which is used to evaluate how many objects are correctly identified in the result of classification. The other is Recall, which is used to evaluate how many positive examples are predicted correctly in the total positive samples. The calculation formulas for Precision and Recall are (3) and (4), respectively.
where TP means that the positive class is predicted to be positive; FP means that the negative class is predicted to be positive; FN means that the positive class is predicted to be negative. For the target detection network, there is a very important concept, intersection over union (IOU). The degree of overlap of two regions is expressed by IOU. When it is adopted to test the accuracy of the network prediction, IOU expresses the overlap between the prediction box and the labeled box. The calculation formula is as follows: Firstly, the result of experiment is evaluated by mAP (IOU = 0.50) in this paper, and 10 images are selected randomly from the validation set to calculate mAP values. Secondly, 100 images are randomly selected from the validation set to calculate mAP (IOU = 0.65). According to mAP values, precision of network prediction can be verified after adding the CLAHE algorithm. The results are shown in Table 1. We can see that the precision of network prediction is improved by means of adding a preprocessing module. The instance segmentation results of the dataset are shown in Figure 6. Even with many precipitations, most protein crystals are identified by the network after adding the CLAHE. The results of instance segmentation are more conforming with the shape of protein crystals. The instance segmentation results of the dataset are shown in Figure 6. Even with many precipitations, most protein crystals are identified by the network after adding the CLAHE. The results of instance segmentation are more conforming with the shape of protein crystals.

Conclusions
Mask R-CNN is introduced in this paper and the CLAHE algorithm is tried as an image pre-processing module. The two parts are combined to realize the instance segmen-