1. Introduction
Improving people’s economy and quality of life have invariably increased human activities, which have brought about some negative effects, for example, environmental pollution [
1]. Although environmental governance has never stopped, environmental pollution is still on the rise. According to the American “Science” magazine, by 2025, an estimated 250 million tons of waste will enter the ocean. Among them, coastal waste accounts for a very important proportion, and plastics are the main harmful pollutant [
2]. The growth of coastal waste not only threatens marine life but also damages the living environment of surrounding residents [
3]. Therefore, to degrade waste based on its nature and reduce environmental pollution, automatic waste classification and recognition are particularly important in the disposing of waste.
The vigorous development of computer hardware has laid the foundation for the amazing achievements of deep learning in the field of computer vision applications, including face detection [
4,
5], medical diagnosis [
6], traffic safety monitoring [
7,
8], and smart agriculture [
9,
10]. For example, Qin et al. [
11] developed the joint training model for face detection, which explains how the backpropagation is used in the training convolution neural network model cascade. To solve the occluding brought by the mask and sunglasses, Wang et al. [
12] proposed a face detector FAN (Face Attention Network) that can effectively improve the precision of face detection in the occluded case. Shen et al. [
13] introduced deep learning models to extract features, instead of traditional methods, by hand-designing features that can detect and classify, concerning the computer-assisted analysis of the image in medical images. The core of the algorithms can mine the different hierarchical feature representations from data, resulting in enhanced performance of medical analysis applications. Yao et al. [
7] proposed the long-short-time memory model to predict freeway safety, and the naïve Bayes was employed to recognize image processing. Various stages of algorithm realization were studied, including data processing, model training, and implementation. Rahnemoonfar et al. [
14] presented a simulated DCNN, which improves the Inception-ResNet model for fruit yield estimation. The experiment shows 93% in terms of test accuracy on synthetic and 91% on real images. From the works above, deep learning has been successfully applied to many fields and penetrated our lives.
Coastal waste destroys marine ecosystems and creates aesthetic discomfort. We can also frame the waste classification issue as an image classification task. Using deep learning to classify and identify waste is a fascinating research topic in computer vision, which also points out the direction of disposing of waste pollution on the coastal. ANH H. VO et al. [
15] proposed a deep neural network, named the DNN-TC model, which was based on the ResNext model, to improve the trash classification performance. The experiments achieved the accuracy of 94% and 98% for NV-trash and Trashnet datasets respectively, which outperforms the performance of Densent121_Aral, RecycleNet, ResNext-101, and ResNet_Ruize on waste classification. Xu et al. [
16] used the lightweight model and transfer learning to classify and identify waste by relocating and reconstructing MobileNetV2. The rebuilt network is employed for the extraction of classification features, and then, the SVM is considered as the model classifier to identify six categories of waste, which yields 98.4% in terms of accuracy for the TrashNet dataset. The paper also mentions that the improved model can conquer the problem of low data volume and over-fitting to realize high classification accuracy. Awe et al. [
17] used Faster R-CNN to identify the different types of waste that were divided into paper, landfill, and recycling. The image dataset was produced by fusing 2–6 pieces of images of the TrashNet dataset on white background. The author fine-tuned the model by altering the last layers of the network and achieved 68% mAP. Fulton et al. [
18] evaluated the performance of four state-of-the-art deep learning models, YOLOv2, Tiny-YOLO, Faster R-CNN, and SSD, on the marine debris dataset. A unique marine debris dataset was set up and used for the deep visual object detection task. However, the performance of those deep neural models is unsatisfactory in cases where the image contains small objects. To obtain more performance of the waste classification and detection, it is necessary to improve the model performance. The main contributions of our research are as follows.
First, we propose an improved deep convolutional neural network, based on Faster R-CNN [
19], which is used to extract features and detect objects. Normally, the deeper the network layers, the lower the feature map resolution, the phenomenon results in the harder detection of small objects. To solve the issue and improve the accuracy of waste identification, we incorporate feature maps from the shallower, i.e., Conv4 layer, which has a higher resolution than the Conv5 layer. Their convergence makes the backbone network more invariant, equivariant, and more conducive to classification and identification. Second, the anchor mechanism is employed in the RPN network. In our model, instead of using default anchor parameters, we fit the anchor box scale, according to our dataset, to match objects and correct the contribution of objects in the loss function during the RPN training, which can improve model performance. Third, due to the lack of a sample, in the data pre-processing stage, we use the data augmentation technique to increase the diversity of original data and avoid model overfitting. Fourth, despite a currently large number of common image datasets, the waste dataset is rarely represented for object detection. To our knowledge, there is no publicly available coastal waste database. To continue future research work, we create the first public dataset in this field, named IST-Waste. Lastly, we verify the performance of the improved model on the dataset and show the meaningful enhancement performance over the state-of-the-art methods.
The rest of this study is organized as follows.
Section 2 describes the related work of waste detection and classification. The background is covered in
Section 3, i.e., the principle, advantages, and drawbacks of the Faster R-CNN. Then,
Section 4 is dedicated to the improved model for the anchor box adjustment, data augmentation, and feature fusion.
Section 5 presents the comparison of the experimental results and analysis. Finally, the conclusion is wrapped up by
Section 6.
2. Related Work
Object classification and detection are some of the most basic tasks of computer vision. Nevertheless, research is relatively limited in the waste detection field. In our mind, the main reason for this phenomenon is primarily due to the scarce availability of public waste datasets. Therefore, we collect the IST-Waste dataset with 3000 images each annotated in the paper. To promote the next research in the area, we make the IST-Waste dataset publicly available. To the best of our knowledge, aside from the TACO dataset [
20], including the 1500 dataset, IST-Waste is the unique public coastal waste dataset. Our work will be the first study in the classification and detection of coastal waste. Then, we briefly describe some classic works on waste classification, recognition, and segmentation, which are closely related to ours.
To resolve the issue of street litter pollution, Ping et al. [
21] developed the deep neural network model to detect and classify the various type of street waste, such as leaves, tree branches, and so on. Additionally, the street waste images are collected and manufactured by the vehicle equipped with cameras and an edge station. Chen et al. [
22] proposed an automatic grasping system for garbage classification, based on computer vision, where the RPN and the VGG model are used for classification and grabbing of the object. Ramalingam et al. [
23] used the cascaded machine learning model, which combines CNN with SVM to detect and classify debris in floor-cleaning. The proposed method fields 95.5% accuracy and takes 71 milliseconds for the whole process of classification and recognition, which proves the approach is adaptive for arranging, in real-time, floor-cleaning applications. Jia et al. [
24] presented an automatic inspection and cleaning table method using DCNN to detect the food litter on the table. High score confidence of classification is produced for each type of waste, such as liquid and solid. The built scheme is compared with Faster R-CNN Resnet and SSD models in the paper, which verifies the validity for the HSR robot. Toğaçar et al. [
25] introduces the comprehensive method, based on the AutoEncoder network and feature extraction of the CNN model, with the SVM as a classifier to classify waste. The RR algorithm is used to reduce the number of features and disclosed valid features. The results show that the ResNet-50 model expresses the best waste classification performance comparison with AlexNet and GoogLeNet on two datasets.
The works involve the segmentation component. Bai et al. [
26] presented the robot for automatically cleaning garbage, with two-stages CNN, on the grass in places such as playgrounds or parks. Firstly, the authors implemented waste segmentation based on the SegNet model on the ground without human involvement. Then, the famous ResNet model was employed for waste classification. To solve the problem of waste location from RGB and depth image, Wang et al. [
27] developed the novel waste segmentation structure that fuses depth and intensity reasoning, which does not desire target-level annotations. The improved CRF model extracts the last segmentation results with depth-level, appearance-level, and pixel-level information. They collected the MJU-Waste dataset, which is the first public dataset for waste segmentation.
Although each of these papers surveyed above have made certain achievements in waste classification, detection, or segmentation. The natural properties of waste itself bring many difficulties to the research work in the area, such as the variety of waste, millions of shapes, complex and irregular stacking shapes, and even the phenomenon of waste decay and mutilation. Therefore, researchers sometimes have to look for a balance between model performance and speed.
4. Our Approach
This part mainly describes the proposed method for coastal waste recognition. Feature fusion, RoI Align, correction of the anchor boxes, and data augmentation are employed to achieve richer semantic presentation. The ultimate goal of our approach is to obtain effective and accurate context information to improve the detection performance of the Faster R-CNN. The detailed structure of our approach is shown in
Figure 2. In general, the proposed method still consists of the RPN and Fast R-CNN. We import more details into the Faster R-CNN to train the proposals. In the data pre-processing stage, we use data augmentation to increase the number of samples and avoid model overfitting. Secondly, after inputting the coastal waste image, the VGG16 model is used to extract image features. Then, the fusion feature map, with higher resolution and high-level semantic information from the fourth and the fifth feature maps, is more conducive to the detection of small objects. At the same time, it also serves as an input for subsequent models. Next, we optimize anchor boxes by the clustering algorithm, which is fitted for better coverage of the size of objects according to the distribution of the coastal waste. RoI Align can reduce proposal deviation.
Similar to the original Faster R-CNN, the multi-task loss function also contains location loss and classification loss in our model. In the RPN network, the produced anchor is only divided into foreground and background, labeled 1 and 0, respectively. The classical binary cross-entropy is used to calculate the classification loss. In Fast R-CNN, multiple-classification cross-entropy is employed to compute the loss. In the regression section, the loss is counted only in the foreground. The model total loss is defined as:
The objective of the training is to iteratively minimize the average empirical, where is prediction of the classification loss, is prediction of the regression loss, λ is the balance parameter.
We use Stochastic Gradient Descent which is a classic optimization algorithm adopted in convolutional deep networks to update model weights.
where
α is the learning rate,
μ is the momentum weight for weight
, and ∇ is the partial derivative operator.
4.1. Region Proposal Network (RPN)
The main innovation of Faster R-CNN belongs to RPN and is responsible for predicting object bounding box with anchor mechanism and score for the object. The essence of each score is to determine whether there are objects in the proposal regions.
Figure 3 represents the framework of the RPN. The shared feature map of Faster R-CNN is mainly used for RPN and participates in the operation of RoI pooling. A convolution operation of 3 × 3 is performed on the feature map, and finally, the intermediate layer, with 256 channels, is obtained. Every center of the feature map corresponds to an area of the original image and is covered by the k anchor boxes. The whole anchor consists of anchor boxes with different scales and aspect ratios, describing objects of various sizes.
Most of the previous networks have used specific heuristics to decide anchor values. For example, there are nine anchor boxes in the standard Faster R-CNN, which are based on the size of hand-picked values, including three scales (128 × 128, 256 × 256, 512 × 512) and three aspect ratios (1:1, 1:2, 2:1). In practical applications, objects come in all shapes and sizes. If we still quote the default size of the anchor box, which has a passive impact on the performance of the training model, objects of various sizes for different datasets, settling adaptive size, and the number of anchors can accelerate the model convergence speed and advance the detection accuracy. In our works, instead of using default values in the original Faster R-CNN, we apply k-means clustering on our dataset, inspired by YOLO [
31], to automatically adopt anchor box size. To balance the computational complexity and the accuracy of the model, three basic size boxes are selected for clustering in the initialization. It is the width and height distribution of the box from our dataset in
Figure 4. The k-means clustering result according to our samples is shown in
Figure 5. Firstly, the three initialized samples are selected as the initial cluster center, including the distance between each sample in the dataset. Next, the three cluster centers are calculated and divided into the corresponding class of the cluster center with the smallest distance. Finally, its cluster center is recalculated for each category until the minimum error result is obtained.
Eventually, we assign three aspect ratios {1:2, 2:1, 1:3} and scales {128 × 145, 196 × 212, 256 × 378} of the anchor boxes, which can take into account the coastal waste of different scales in our dataset. Thus, nine anchor boxes are generated at every center of the feature map. Then, 36 box regression and 18 box classifications are produced in the RPN for every region proposal.
4.2. RoI Align
In common two-stage detection frameworks (such as Fast-RCNN, Faster-RCNN, and RFCN), RoI pooling is used to pool the corresponding area in the feature map into a fixed-size feature map, according to the position coordinates of the proposal boxes, and conduct subsequent classification and box regression operations. Since the position of the proposal box is usually obtained by model regression, it is a floating-point number, and the pooled feature graph requires a fixed size. After the above two quantifications, the proposal boxes, at this time, have a certain deviation from the original regression position, which affects the accuracy of detection or segmentation, especially for small object detection. To advance the issue, the RoI Align algorithm is adopted to get the feature map of the rich information. The quantization operation is canceled, and the bilinear interpolation method is used to obtain the image values of the pixels, whose coordinates are floating-point numbers, so the whole process of feature aggregation can be transformed into a continuous operation. There are three steps for RoI Align:
(RoI division): The candidate regions are divided into k × k cells and each cell is not quantified.
(Interpolation): Interpolating the values of all sampling points (each grid s × s points).
(Max pooling): Finding the maximum value of all s × s sampling points in a grid.
4.3. Data Augmentation
Training dataset has a significant impact on detection model performance. That’s because the only source for detection model learning features is from training data. The lack of training data is the first key problem that researchers should tackle. Especially for the small objects, the research found that the computed IoU between the predicted anchor boxes and the ground-truth boxes is much lower than expected. Data augmentation strategy can deal with the issue and bring stronger generalization ability to the model. In coastal waste detection, we use data augmentation, including cropping, rotating, and scaling to produce auxiliary samples of waste. It can not only effectively alleviate the overfitting of the model, but it can bring richer the feature of the model. The accuracy of small object detection can also be advanced by expanding the categories and numbers of small object samples during training.
4.4. Feature Fusion Layer
Instance object detection has always been a difficult task in general object detection. Cigarette butts, glass residue, and bottle caps in the obtained samples are sometimes low-resolution. The VGG16 is regarded as the backbone in the Faster R-CNN, which owns five feature maps. The whole model only uses the fifth feature map to join the subsequent work. Therefore, it is difficult for the state-of-the-art Faster R-CNN to recognize small objects. The first reason is that the single-layer feature map represents incomplete image information. Another reason is that the Conv5_3 has a large receptive field. It can capture a wide range of contextual information and ignore the smaller ones. We then fuse the convolutional feature maps, Conv4_3 and Conv5_3, to enhance semantic features. The structure of the multi-scale feature map is shown in
Figure 6. The size of every feature map is different in the model. We adjust the size of the Conv5_3 to match the Conv4_3 by up sampling the Conv5_3. Then, the L2-normalization output of the two layers [
32] is concatenated to utilize as the input for the RPN.
6. Conclusions
Object instance detection is always a difficult problem in general object detection. Coastal waste often contains a lot of small objects, such as cigarette butts, scraps of paper, broken glass, bottle caps, etc. In the paper, we proposed a deep neural network, based on Faster R-CNN, to detect coastal waste. We aimed at synchronizing several options to improve the standard Faster R-CNN performance. Detecting small objects could be addressed by fusing high-resolution features with high-dimensional features from the low-resolution image. Moreover, generating anchor boxes, according to the size of our dataset, was conducive to improving the performance of the model. Besides, RoI Align instead of RoI pooling to solve position offset could also effectively boost the performance of automated coastal waste detection. Data augmentation brought into the model avoided the overfitting phenomenon. Eventually, the experimental results showed that the developed deep learning model has obtained a relatively good accuracy, which has met the requirements of coastal waste detection and revealed the great potential in the related topics.
We still face some challenges, such as object deformation or decay of the object, data annotation, and model selection. In future work, we will collect more samples to exploit the model to produce more detection performance. At the same time, it will be worth researching direction for the improvement of model speed.