Health Monitoring for Balancing Tail Ropes of a Hoisting System Using a Convolutional Neural Network

: With the arrival of the big data era, it has become possible to apply deep learning to the health monitoring of mine production. In this paper, a convolutional neural network (CNN)-based method is proposed to monitor the health condition of the balancing tail ropes (BTRs) of the hoisting system, in which the feature of the BTR image is adaptively extracted using a CNN. This method can automatically detect various BTR faults in real-time, including disproportional spacing, twisted rope, broken strand and broken rope faults. Firstly, a CNN structure is proposed, and regularization technology is adopted to prevent overﬁtting. Then, a method of image dataset description and establishment that can cover the entire feature space of overhanging BTRs is put forward. Finally, the CNN and two traditional data mining algorithms, namely, k-nearest neighbor (KNN) and an artiﬁcial neural network with back propagation (ANN-BP), are adopted to train and test the established dataset, and the inﬂuence of hyperparameters on the network diagnostic accuracy is investigated experimentally. The experimental results showed that the CNN could effectively avoid complex steps such as manual feature extraction, that the learning rate and batch-size strongly affected the accuracy and training efﬁciency, and that the fault diagnosis accuracy of CNN was 100%, which was higher than that of KNN and ANN-BP. Therefore, the proposed CNN with high accuracy, real-time functioning and generalization performance is suitable for application in the health monitoring of hoisting system BTRs.


Introduction
A mine's hoisting system is the only way to connect the underground with the ground and is known as the "throat" of the mine [1,2]. It is a mechatronics-hydraulics-integrated system (comprising a driving friction pulley, hoisting ropes, head sheaves, containers, balancing tail ropes, etc.) [3], including complex dynamic characteristics like inertia, flexibility, and damping in its operation. The tail rope is an important component of the hoisting system. It is set up to balance the gravity of the hoisting rope and to obtain equal moments in the mine hoisting system [3]. Hence, the working state and mechanical properties of the tail rope directly affect the safety of mine production [4].

Image Data-Driven Monitoring System Framework
A schematic diagram of the proposed image data-driven monitoring system framework is presented in Figure 1.
The monitoring system framework is composed of three parts, including the image acquisition system, the vertical shaft movable sensor network [28] and the upper computer. The image acquisition system includes a light source, CCD (charge coupled device) cameras, an acquisition card and memory, and can realize the real-time collection of the BTRs image data. The movable sensor network transfers the collected image data to the upper computer. The upper computer is made up of one or more high-performance deep learning workstations, allowing it to achieve the deep mining of big image data features, analyze the data, and give BTR fault warnings. If the tail rope is found to be twisted, broken, or unevenly distributed, the diagnosis information will be sent out immediately so as to avoid the enlargement of the fault. Our work mainly focuses on the study of health monitoring methods. Other aspects of the system, such as the design of the hardware and software of the image acquisition system and the design of the movable sensor network, are not discussed in this paper.
As shown in Figure 1, the proposed image data-driven framework for monitoring BTRs is developed by the following steps: Step 1. Generate the training and testing dataset: collect the BTR image data, clean the data, and divide the processed BTR image data into training and testing datasets [29].
Step 2. Develop the model: based on the dataset, apply data-driven algorithms to develop models for predicting the BTRs' condition [29]. To adjust and optimize the parameter settings of algorithms, the trial-and-error method [20,30] is employed.
Step 3. Model selection: compute the prediction accuracy based on the developed models, and select the most accurate one for monitoring the BTRs' condition.
Step 4. Online monitoring: design the hardware and software of the monitoring system, and apply them to online monitoring.

Container
Vertical ropes The monitoring system framework is composed of three parts, including the image acquisition system, the vertical shaft movable sensor network [28] and the upper computer. The image acquisition system includes a light source, CCD (charge coupled device) cameras, an acquisition card and memory, and can realize the real-time collection of the BTRs image data. The movable sensor network transfers the collected image data to the upper computer. The upper computer is made up of one or more high-performance deep learning workstations, allowing it to achieve the deep mining of big image data features, analyze the data, and give BTR fault warnings. If the tail rope is found to be twisted, broken, or unevenly distributed, the diagnosis information will be sent out immediately so as to avoid the enlargement of the fault. Our work mainly focuses on the study of health monitoring methods. Other aspects of the system, such as the design of the hardware and software of the image acquisition system and the design of the movable sensor network, are not discussed in this paper.

Image acquisition system
As shown in Figure 1, the proposed image data-driven framework for monitoring BTRs is developed by the following steps: Step 1. Generate the training and testing dataset: collect the BTR image data, clean the data, and divide the processed BTR image data into training and testing datasets [29].
Step 2. Develop the model: based on the dataset, apply data-driven algorithms to develop models for predicting the BTRs' condition [29]. To adjust and optimize the parameter settings of algorithms, the trial-and-error method [20,30] is employed.
Step 3. Model selection: compute the prediction accuracy based on the developed models, and select the most accurate one for monitoring the BTRs' condition.
Step 4. Online monitoring: design the hardware and software of the monitoring system, and apply them to online monitoring.

Convolutional Neural Network
A CNN consists of an input layer, a hidden layer, a fully connected layer and an output layer, in which the hidden layer is composed of several alternating convolution layers and pooling layers. The alternating convolution and pooling layers form a sub-convolution-pooling neural network as shown in Figure 2 and the CNN comprises multiple sub-convolution-pooling neural networks [20]. The feature map of the input layer is convoluted by specific convolution kernels in the convolution layer, a bias is added, and then an output feature is obtained by an activation function, in which the

Convolutional Neural Network
A CNN consists of an input layer, a hidden layer, a fully connected layer and an output layer, in which the hidden layer is composed of several alternating convolution layers and pooling layers. The alternating convolution and pooling layers form a sub-convolution-pooling neural network as shown in Figure 2 and the CNN comprises multiple sub-convolution-pooling neural networks [20]. The feature map of the input layer is convoluted by specific convolution kernels in the convolution layer, a bias is added, and then an output feature is obtained by an activation function, in which the commonly used activation functions are sigmoid, tanh(x), rectified linear unit (ReLU), leaky ReLU, etc. The pooling layer is a feature selection for the output feature map of the convolution layer. The fully connected layer and the output layer constitute the classifier which can be Softmax, support vector machine (SVM), etc. [31,32]. In the convolution layer, the feature map from the upper layer is convoluted by the convolution  In the convolution layer, the feature map from the upper layer is convoluted by the convolution kernel, and then the output feature map is obtained via the activation function [33]: where u j is the net activation of the j channel in the convolution layer, which is obtained by summing the convolution and bias of the output feature map x −1 i of the upper layer. x j is the output of the j channel of the convolution layer. f (·) is called the activation function, and it is a ReLU function in this paper. M j represents a subset of input feature maps for computing, k ij is a convolution kernel, and b j is the bias item of the feature map after convoluting. For an output feature map x j , the convolution kernel k ij corresponding to each input feature map x −1 i may be different.
(2) Pooling The output feature map is obtained by the down sampling layer by sampling every input feature map by the following formula: x where β j is the weight coefficient of the down sampling layer, and b j is the bias of the down sampling layer. The symbol down(·) represents the down sampling function, which calculates the sum, mean or maximum value of the pixel in the n × n region of the input feature map so that the output map is reduced by n times in two dimensions.
(3) Full connection In the fully connected network, all two-dimensional image features are stitched into one-dimensional features as inputs to the fully connected network. The output of the full connection layer can be obtained by weighting and by the activation function: where w is the weight coefficient of the fully connected network, and b is the bias item of the fully connected layer.

(4) Classification
To solve the multi-classification problem, the Softmax [34] function, which is located in the last layer, is usually used. It is expressed as the probabilistic expression p(y = j/x), where x is the input sample and the corresponding label is y, p is the probability of sample j. Therefore, the output will be an n-dimensional vector for a classifier with n classes and the sum of the elements in a vector is 1, as shown by Equation (6) [20,35]: . . .
where w is the weight, and w T n x (i) are the inputs of the Softmax layer. The term 1/∑ n j=1 e w T j x (i) normalizes the distribution, so that it sums to 1 [20]. In the training process, the optimization algorithm is used to minimize the loss function to complete the network training. The loss function J(θ) is defined by Equation (7) [35]: where 1 y (i) = j is an indicator function that always returns 1 or 0, which means that when a predicted class of the ith input is true for class j, the result is 1; otherwise, the result is 0.

(5) Regularization
The research [36,37] shows that if the network model performs excellently in the training dataset but has difficulty in obtaining a satisfactory accuracy on the testing dataset, the overfitting phenomenon appears in the model. This phenomenon can be avoided by using regularization technology to restrain the complexity of the model. The commonly used regularization technologies are L 2 regularization, L 1 regularization and dropout. In this paper, we add the L 2 regularization term to the fully connected layer. The L 2 regularization term is in the form of: where ω is the network layer parameter to be regularized, and λ controls the size of the regularization item. Larger values of λ will constrain the model complexity to a large extent.

Structural Design
The CNN structure designed for the health monitoring of BTRs is shown in Figure 3, and the configurations of the convolution, pooling, and fully connected layers are listed in Table 1   The input feature map is grayscale with a size of 28 × 28. The hidden layer is composed of two The input feature map is grayscale with a size of 28 × 28. The hidden layer is composed of two convolution layers and two pooling layers, in an alternating arrangement. The number of convolution kernels of the first and second convolution layers is 64 and 128, respectively (with a size of 3 × 3). Before convoluting, with the "same" padding operation, the convolution results at the boundary are preserved so that the output shape is the same as the input shape. The pooling layer uses maximum sampling, (i.e., finding the maximum value in the 2 × 2 region of the feature map). The fully connected layer is set to three layers, with each layer having 200, 64 and 32 neurons, respectively. The ReLU function is chosen as the activation function of the convolution layers and fully connected layers. The output layer selects the Softmax classifier. To prevent overfitting, we use L 2 regularization to process the fully connected layer F1.

Algorithm Flow and Experimental Environment
Before the convolutional neural network is trained and tested, the image data are collected (through the CCD camera), preprocessed (e.g., scaling, graying, etc.), and divided (via the hold-out method). The algorithm flow chart is shown in Figure 4, it involves two parts, including the forward propagation of the data and the reverse propagation of the error [32]. Firstly, the training parameters of the network are set, the weight and bias of the network are initialized, and then the input feature map processed by the convolution layer, the pooling layer and the fully connected layer is transmitted to the output layer. During this process, the output of each layer is the input of the next layer. Then, the error between the actual output and the expected output is reversely transmitted using the back propagation (BP) algorithm, layer by layer. Next, this error is allocated to each layer, and the weight and bias of the network are adjusted until the convergence condition is satisfied, thus realizing the effective supervised training of the network. The experimental environment is described in Table 2.  The experimental environment is described in Table 2.

Dataset Description and Establishment
The establishment of the dataset is complex, and the richness and accuracy of the dataset have a direct influence on the recognition ability and generalization performance of the network. In this section, we first describe the data (i.e., the tail rope failure categories, forming reasons, and expression forms). Then, based on the data description, we establish a dataset that covers all the features.

Data Description
In the hoisting system, the states of BTRs basically include normal, disproportional spacing, twisted rope, defect, and broken rope. The disproportional spacing is caused by unstable factors in the hoisting system, such as mechanical vibration, wind-induced vibration, etc., which is the precondition of twisted rope. The twisted rope fault occurs when the hoisting system is very unstable. In this paper, the collision contact of the two ropes is also classified as a twisted rope-type fault. The twisted rope-type fault is a serious fault that causes the instability in the hoisting system and produces broken rope or downtime, so it should be avoided. Defects include wear, broken wire, broken strand, and rust, among which, a broken strand, the precondition of a broken rope, is the most serious defect. Because the broken rope directly leads to the instability of the hoisting system or even accidents, we should try to avoid it.
The measure of setting separate woods (using wood to separate each tail rope) has been adopted in order to prevent the collision of the BTRs, but the separate woods tend to damage the BTRs by scratching or pulling, which aggravates the wear and failure of the BTRs. The tail rope is in a state of free overhanging in the shaft, is subjected to random vibrations and external excitation, and its attitude is difficult to estimate. Therefore, according to the actual production situation, we use the empirical method to build up the BTRs' state dataset with the whole feature space as far as possible. The image dataset in this paper is made up of five typical feature states, namely, normal (a), disproportional spacing (b), twisted rope (c), broken strand (d) and broken rope (e), as shown in Figure 5.
It should be noted that the distance of normal (a) here is defined as being greater than 3/4 of the normal distance (the distance between the tail ropes under stationary state). Disproportional spacing (b) is defined as a distance less than 1/2 of the normal distance between the two ropes. Twisted rope (c) is defined as a variety of forms in which two ropes get entangled. Broken strand (d) is divided into three categories, including broken strand of the left rope (d1), the right rope (d2), and double ropes (d3). Broken rope (e) is classified into three categories, including left broken rope (e1), right broken rope (e2), and double broken ropes (e3). As shown in Figure 5, we assume the normal distance between the tail ropes is D, and the view of the image taken by the camera is L long and W wide, such that the following can be obtained: Therefore, the dataset has nine characteristics (i.e., a, b, c, d1, d2, d3, e1, e2, e3). When different faults are diagnosed, an early warning is carried out according to the different levels (Level 1: normal state is not warned; Level 2: when the distance is not uniform, a reminder is given regarding the deceleration operation but no warning is given; Level 3: overhaul warning when there is a broken strand; Level 4: a brake signal is immediately sent out when there is a twisted or broken rope). It is important to note that in order to distinguish the two characteristic states of normal and disproportional spacing, we define these spacings as being greater than 3/4 and less than 1/2 of the normal distance, respectively, and it needs to be observed when the spacing is between 1/2 and 3/4 of the normal spacing (because the identification results may be normal or disproportional spacing). Identification results of normal or disproportional spacing do not affect the fault diagnosis results, because there is no need to take any action (Levels 1-2 are the healthy state, which will not have warnings; Level 3 is a mild malfunction; and Level 4 is a serious fault state). The above method can also be used to describe the data of a hoisting system containing more than two tail ropes. The image dataset in this paper is made up of five typical feature states, namely, normal (a), disproportional spacing (b), twisted rope (c), broken strand (d) and broken rope (e), as shown in Figure 5. It should be noted that the distance of normal (a) here is defined as being greater than 3/4 of the normal distance (the distance between the tail ropes under stationary state). Disproportional spacing (b) is defined as a distance less than 1/2 of the normal distance between the two ropes. Twisted rope (c) is defined as a variety of forms in which two ropes get entangled. Broken strand (d) is divided into three categories, including broken strand of the left rope (d1), the right rope (d2), and double ropes (d3). Broken rope (e) is classified into three categories, including left broken rope (e1), right broken rope (e2), and double broken ropes (e3). As shown in Figure 5, we assume the normal distance between the tail ropes is D, and the view of the image taken by the camera is L long and W wide, such that the following can be obtained:  . Therefore, the dataset has nine characteristics (i.e., a, b, c, d1, d2, d3, e1, e2, e3). When different faults are diagnosed, an early warning is carried out according to the different levels (Level 1: normal state is not warned; Level 2: when the distance is not uniform, a reminder is given regarding the deceleration operation but no warning is given; Level 3: overhaul warning when there is a broken strand; Level 4: a brake signal is immediately sent out when there is a twisted or broken rope). It is important to note that in order to distinguish the two characteristic states of normal and disproportional spacing, we define these spacings as being greater than 3/4 and less than 1/2 of the normal distance, respectively, and it needs to be observed when the spacing is between 1/2 and 3/4 of the normal spacing (because the identification results may be normal or disproportional spacing). Identification results of normal or disproportional spacing do not affect the fault diagnosis results, because there is no need to take any action (Levels 1-2 are the healthy state, which will not have warnings; Level 3 is a mild malfunction; and Level 4 is a serious fault state). The above method can also be used to describe the data of a hoisting system containing more than two tail ropes.

Dataset Establishment
Because of the difficulty associated with collecting samples containing the whole feature space in the field and estimating all kinds of poses with theoretical formulae, in this paper, we set up an experimental image dataset containing nine features with production experience and use techniques to generate more examples by deforming the existing ones [32]. The process of setting up the dataset is as follows: first, typical images of the nine features are set up; then, ten seed images of each type are set up, with each seed image of the same type being different, as depicted in Figure 6 (using the same blue and smooth background plate without texture). Then, the images are expanded to a scale of 4500 by zoom, translation, rotation, and other means to enhance the generalization ability of the network model [38]. The data extension method [39] is as follows: Step 1: The seed images are rotated from −5 degrees to 4 degrees with the increment of 1 degree; Step 2: The images obtained by Step 1 are scaled by a factor ranging from 0.8 to 1.2 with an increment of 0.1; Step 3: All images are uniformly scaled to 28 × 28 by the bilinear interpolation method; Step 4: All images are grayed and converted into line vectors; Step 5: The labels are added and the dataset is established.
Step 1: The seed images are rotated from −5 degrees to 4 degrees with the increment of 1 degree; Step 2: The images obtained by Step 1 are scaled by a factor ranging from 0.8 to 1.2 with an increment of 0.1; Step 3: All images are uniformly scaled to 28 × 28 by the bilinear interpolation method; Step 4: All images are grayed and converted into line vectors; Step 5: The labels are added and the dataset is established. In data expansion process, the rotation is designed to simulate the inaccuracy of the camera installation angle in the actual shooting or the swing of the tail rope in the field of vision. Scaling is used to simulate different image sizes. The bilinear interpolation method is used to scale the images to a uniform size to facilitate the standardization of the data (the size of the image in this paper is 28 × 28, and common sizes are 32 × 32, 64 × 64, etc.). Gray processing is used to remove the influence of color and illumination so that the input data contain only the position and the defect feature information of the tail ropes. After converting the grayscale images into vectors and adding labels, data mining can begin, using the constructed algorithm model.
It is known that the images collected by CCD cameras under actual working conditions are of two wire ropes in different states, with the position of the wire ropes and the state of the broken strand on the ropes being the main image characteristics. The recognition results should not be influenced by the image background, oil pollution on the wire rope surface, obvious light changes, and so on.
After image preprocessing, the experimental dataset is essentially consistent with the actual scene dataset, which is a 28 × 28 gray pixel matrix that can directly reflect the position of the wire rope and the shape of the broken strand. In order to further illustrate the feature information of the position and defect of the tail rope after scaling and grayscale processing, we display bilinear interpolation scale images and grayscale images in Figure 7. We randomly selected some images in Figure 6 (e.g., the eighth image of the twisted rope (c-8) and the first image of the broken strand of the left rope (d1-1)), then we used the following image processing method: first, the bilinear interpolation method was used to scale the size to 28 × 28. Then, graying was done. The information of the position and defect features of the tail ropes are clearly visible in the scaling and graying images. Because the CNN is not sensitive to the scale and rotation of the input image data, it can automatically mine and learn the potential feature information of the dataset.
The nine kinds of tail rope states are given in Table 3.
color and illumination so that the input data contain only the position and the defect feature information of the tail ropes. After converting the grayscale images into vectors and adding labels, data mining can begin, using the constructed algorithm model. It is known that the images collected by CCD cameras under actual working conditions are of two wire ropes in different states, with the position of the wire ropes and the state of the broken strand on the ropes being the main image characteristics. The recognition results should not be influenced by the image background, oil pollution on the wire rope surface, obvious light changes, and so on. After image preprocessing, the experimental dataset is essentially consistent with the actual scene dataset, which is a 28 × 28 gray pixel matrix that can directly reflect the position of the wire rope and the shape of the broken strand. In order to further illustrate the feature information of the position and defect of the tail rope after scaling and grayscale processing, we display bilinear interpolation scale images and grayscale images in Figure 7. We randomly selected some images in Figure 6 (e.g., the eighth image of the twisted rope (c-8) and the first image of the broken strand of the left rope (d1-1)), then we used the following image processing method: first, the bilinear interpolation method was used to scale the size to 28 × 28. Then, graying was done. The information of the position and defect features of the tail ropes are clearly visible in the scaling and graying images. Because the CNN is not sensitive to the scale and rotation of the input image data, it can automatically mine and learn the potential feature information of the dataset.
The nine kinds of tail rope states are given in Table 3.

Experiment and Analysis
This section describes our experiment and presents the analysis of the obtained results. Firstly, we propose the evaluation methodology and metrics for the performance measure. Secondly, we describe the data mining of the tail rope dataset using the CNN. Then, we provide a comparison with other traditional intelligent methods (e.g., KNN and ANN-BP) that we used to carry out the BTR fault diagnosis. Finally, the results of each algorithm are compared and analyzed. During the study of the different algorithms, the related parameters are adjusted to achieve better accuracy, the hold-out method is used to verify its generalization performance, and the diagnosis results are analyzed using the confusion matrix.

Evaluation Methodology and Performance Measure
In general, in the actual task, we need to evaluate the generalization error of the model, and then choose the model with the smallest generalization error. Therefore, it is necessary to use the testing set to test the discriminant ability of the model, and then take the test error of the testing set as an approximation of the generalization error. The testing set and training set are usually mutually exclusive, (i.e., the test samples do not appear in the training set and are not used in the training process). Therefore, this paper uses the hold-out method [40] to evaluate the model. The hold-out method directly divides the data set D into two mutually exclusive sets, namely the training set A and the testing set B (D = A∪B, A∩B = Ø) [41]. After the model is trained using training set A, testing set B is used to evaluate the test error as an estimate of the generalization error.
After evaluating the generalization performance of the model, it is necessary to measure the performance of the model with evaluation metrics. In this paper, four evaluation metrics are calculated, namely accuracy, precision, recall and f1-score. Their formulas can be seen in Equations (9)-(12) [22]: where TP means true positive, FP means false positive, TN represents true negative, and FN represents false negative. All of them are classified according to the combination of the real category and model prediction category [42]. Taking the binary classification as an example, the confusion matrix of the classification results is shown in Table 4. It is clear that the total number of samples equals the result of the formula TP + FP + TN + FN.

Positive Negative
Positive

FP TN
Different metrics directly reflect the impact of health monitoring tasks. For example, accuracy can directly reflect the number of correct and erroneous prediction results for all of the test samples. Precision can reflect a certain category of test samples, how many predictions are correct, and how many predictions are incorrect. For example, if in a testing set containing 100 samples of twisted rope, 90 are predicted to be twisted rope faults and 10 are classified as other faults, the precision for the twisted rope fault is 90%. Recall and precision are a pair of contradictory measurements. Recall shows how many predictions are correct in a certain class of prediction results. For example, if 100 prediction results are twisted rope faults, of which 90 test samples are actually twisted rope faults and 10 test samples are other faults, then the recall of the twisted rope fault is 90%. A good classifier maximizes both precision and recall to make fewer incorrect prediction results, which is expressed in the f1-score. The f1-score is the harmonic average of precision and recall.
The tail rope health monitoring in this paper is a multi-classification task. According to Section 4, different kinds of features, including normal (a), disproportional spacing (b), twisted rope (c), broken strand (d), and broken rope (e), should not normally be predicted incorrectly because their features are quite different. It may be difficult for classifiers to classify similar categories, for example, classifying between subcategories of broken strand (d): broken strand of the left rope (d1), broken strand of the right rope (d2), and broken strand of double ropes (d3). If the defects on the left or right were to change in size, shape, or height, it is possible that broken strand of double ropes (d3) would be predicted as broken strand of the left rope (d1) or broken strand of the right rope (d2), or vice versa. In addition, for the faults left broken rope (e1), right broken rope (e2), and double broken ropes (e3) in the category broken rope (e), when the position or height of the broken rope change, it is easy to predict double broken ropes (e3) as left broken rope (e1) or right broken rope (e2), or vice versa.
In the following, the performance of the classifiers is measured with the metrics given by Equations (9)- (12), and the prediction results are visualized by the confusion matrix.

The Convolutional Neural Network (1) CNN parameters selection
Concerning the CNN configuration, it is still an open question what hyper-parameters (e.g., number of layers, learning rate, size of the filters, batch-size, etc.) are useful to a greater or lesser extent for this task [42]. The hyper-parameters are adjusted in order to study the performance of the built CNN. The choice of learning rate and batch-size severely affects the training and testing results, so we adjust and study the learning rate and batch-size in this paper [20]. The structure of CNN is shown in Figure 3, and the configurations of each layer are listed in Table 1. In addition, before each round of training, the dataset is randomly disturbed, the network parameters are randomly initialized, the L 2 regularization term is added to the fully connected layer F1, and a stochastic gradient descent (SGD) algorithm is used to train the network [35]. Before the training and testing of the BTRs dataset, 70% of the total sample is selected randomly as the training sample, and the remaining 30% is used as the testing sample (i.e., 3150 samples are selected as the training dataset and 1350 samples are used as the testing dataset). After training and testing, we mainly use Equation (9) (accuracy) to evaluate the performance of the CNN.
(a) Learning rate An ideal learning rate will accelerate the convergence of the model, while an undesirable learning rate will even directly cause the loss of the objective function to explode and fail to complete the training [43]. In this section, the network iteration is set to 40 epochs, and the initial batch-size is set to 5. The training loss, training accuracy, testing loss and testing accuracy under different learning rates are shown in Table 5.
From the data in Table 5, the training and testing curves are made as shown in Figure 8. Table 5 and Figure 8 show that both the training accuracy and testing accuracy are 100% around the learning rate of 0.01, and that the accuracy is highest and stable at this rate. During training and testing, the accuracy and loss of testing are basically consistent with the training accuracy and loss, indicating that there is no significant noise in the dataset, and the network performance is good. With the increase in the learning rate, the training accuracy and the test accuracy first increase, then remain stable, and finally reduce quickly (training loss and testing loss decrease at first, then keep stable, finally increase and keep stable), indicating that smaller and larger learning rates reduce the accuracy of the network. Therefore, in this experiment, the optimized learning rate is set to 0.01. From the data in Table 5, the training and testing curves are made as shown in Figure 8. Table 5 and Figure 8 show that both the training accuracy and testing accuracy are 100% around the learning rate of 0.01, and that the accuracy is highest and stable at this rate. During training and testing, the accuracy and loss of testing are basically consistent with the training accuracy and loss, indicating that there is no significant noise in the dataset, and the network performance is good. With the increase in the learning rate, the training accuracy and the test accuracy first increase, then remain stable, and finally reduce quickly (training loss and testing loss decrease at first, then keep stable, finally increase and keep stable), indicating that smaller and larger learning rates reduce the accuracy of the network. Therefore, in this experiment, the optimized learning rate is set to 0.01.

(b) Batch-size
When the SGD method is adopted, the batch-size has a great influence on network performance. In this section, we set the network iteration to 40 epochs, and the learning rate to 0.01. The training loss, training accuracy, testing loss, testing accuracy, and time cost of different batch-sizes are shown in Table 6.
The training and testing curves are made as shown in Figure 9, according to the data in Table 6. From Table 6 and Figure 9, it is known that when the batch-sizes are 1, 3, and 5, the training accuracy and testing accuracy are both 100%, and that the accuracy is the highest and stable. With the increase in the batch-size, the training accuracy and testing accuracy remain stable at first, then reduce quickly (training loss and testing loss keep stable at first, then increase fast), indicating that larger batch-sizes reduce the accuracy of the network. It is also found that larger batch-sizes led to less time being consumed for each iteration. If we use graphics processing units (GPUs) to accelerate the computation process via parallel computation, we can significantly reduce the iteration time. Therefore, in this experiment, when the learning rate of the CNN model is set to 0.01 and the batch-size is set to 5, the training and testing accuracies are high, and the time consumption of each iteration is less, meeting the requirements of accuracy and real time.   (2) Detailed Results

(a) Hold-out method
The hold-out method [40] is used to evaluate the generalization error of the model. First, 4500 samples are randomly disturbed and then a certain proportion of these samples are chosen for training and testing using the hold-out method. Each training lasts for 40 epochs and the evaluation metrics of the test data set are calculated according to Equations (9)- (12). After training and testing five times and calculating the mean value, the results are shown in Table 7. Table 7. The different dividing ways of the hold-out method. (2) Detailed Results

(a) Hold-out method
The hold-out method [40] is used to evaluate the generalization error of the model. First, 4500 samples are randomly disturbed and then a certain proportion of these samples are chosen for training and testing using the hold-out method. Each training lasts for 40 epochs and the evaluation metrics of the test data set are calculated according to Equations (9)- (12). After training and testing five times and calculating the mean value, the results are shown in Table 7. According to Table 7, we find that the method of dividing the dataset between training and testing has little effect on the experimental results, and that the four metrics under each division are all 1, with only a small difference in the loss and time consumption. These results demonstrate that the established CNN network model has a good performance.
In similar applications of machine learning and CNNs for image classification, approximately 2/3~4/5 samples are generally used for training and the rest are used for testing [44]. Therefore, we use 75% of the data for training and 25% of the data for testing in the next part.
(b) Iterative process Through all of the above studies, we adopt the CNN structure proposed in Section 3, combined with the Table 1, to determine the following network settings:

•
The dataset is randomly divided using the hold-out method, 75% is divided into the testing set, and 25% is divided into the training set; • The learning rate is set to 0.01, the batch size is set to 5, and the iteration is set to 40; • The fully connected layer F1 is processed using L 2 regularization; • The network is trained using an SGD algorithm.

•
The iterative process of training and testing for 40 epochs is shown in Figure 10. In similar applications of machine learning and CNNs for image classification, approximately 2/3~4/5 samples are generally used for training and the rest are used for testing [44]. Therefore, we use 75% of the data for training and 25% of the data for testing in the next part.
(b) Iterative process Through all of the above studies, we adopt the CNN structure proposed in Section 3, combined with the Table 1, to determine the following network settings: • The dataset is randomly divided using the hold-out method, 75% is divided into the testing set, and 25% is divided into the training set; • The learning rate is set to 0.01, the batch size is set to 5, and the iteration is set to 40; • The fully connected layer F1 is processed using L2 regularization; • The network is trained using an SGD algorithm.
• The iterative process of training and testing for 40 epochs is shown in Figure 10. From Figure 10, it can be seen that during the 40-epoch iterative process: the training accuracy and testing accuracy increase rapidly and approach 100%, reaching 90% in approximately 10 rounds, and reaching 99% in around 17 rounds. The training loss and testing loss converge quickly and eventually close to 0.0002. The training accuracy is consistent with the testing accuracy in the iterative process, as well as the training loss and testing loss. The testing results are as good as the training results, which shows that due to the regularization processing of the network, there is no overfitting phenomenon and the generalization performance is good. After 20 rounds, the training and test From Figure 10, it can be seen that during the 40-epoch iterative process: the training accuracy and testing accuracy increase rapidly and approach 100%, reaching 90% in approximately 10 rounds, and reaching 99% in around 17 rounds. The training loss and testing loss converge quickly and eventually close to 0.0002. The training accuracy is consistent with the testing accuracy in the iterative process, as well as the training loss and testing loss. The testing results are as good as the training results, which shows that due to the regularization processing of the network, there is no overfitting phenomenon and the generalization performance is good. After 20 rounds, the training and test curves are relatively smooth, indicating that there is no need to iterate for 40 rounds to achieve a better effect. Meanwhile, the time consumption of each iteration is less (32 s/epoch, 10 ms/step).

(c) Confusion matrix
A confusion matrix is used to present the performance and the result of the CNN, as shown in Table 8 and Figure 11. The accuracy, precision, recall, and f1-score are all 1 for the 1125 prediction samples, and the prediction results of each category are exactly the same as the actual results (labels), indicating the good performance of the CNN algorithm. The CNN has a good prediction ability for the tail rope faults, can completely separate the nine kinds of tail rope states, and can accurately predict them.   Therefore, the convolutional neural network for the health monitoring and fault diagnosis of hoisting system BTRs proposed in this paper presented a good performance, meeting the requirements of accuracy, real-time functioning, and generalization performance.

The k-Nearest Neighbor and Artificial Neural Network with Back Propagation
(1) KNN The KNN [45] is a classification method based on statistics. It was first proposed by Cover and Hart in 1968. As the simplest machine learning method, the algorithm is relatively theoretically mature and is widely used in classification tasks [46]. This algorithm performs the following Therefore, the convolutional neural network for the health monitoring and fault diagnosis of hoisting system BTRs proposed in this paper presented a good performance, meeting the requirements of accuracy, real-time functioning, and generalization performance.

The k-Nearest Neighbor and Artificial Neural Network with Back Propagation
(1) KNN The KNN [45] is a classification method based on statistics. It was first proposed by Cover and Hart in 1968. As the simplest machine learning method, the algorithm is relatively theoretically mature and is widely used in classification tasks [46]. This algorithm performs the following operations on each unknown category in the dataset: Step 1. The distance between the point of the dataset with a known class and the current point is calculated; Step 2. The distances are sorted according to increasing order of distance; Step 3. k points with the minimum distance are selected from the current point; Step 4. The occurrence frequency of the category of the previous k points is determined; Step 5. The class with the highest frequency of the previous k points is selected as the pre-classification of the current point. In Step 1, computing the distance includes the Euclidean distance, Manhattan distance, etc. (the latter one is utilized in this paper). The eigenspace χ is an n dimensional real vector space R n , where In practical applications, the choice of the k value should not be too small or too large, because the prediction results are very sensitive to the value of k [47]. For example, we choose a few k values, such as 7, 10, 13, 15, and 20, and the accuracy results are 85.24%, 88.44%, 86.67%, 85.42%, and 81.33%, respectively, which illustrates that the accuracy of each prediction with different k values is quite different. To find the k nearest neighbor points quickly, we use the ball-tree [48]. The ball-tree is suitable for high-dimensional problems, generally when the feature dimension is greater than 20 [49]. In this paper, the dimension of the dataset is 784. After adopting the ball-tree in KNN, the prediction accuracy is 94.04% and the time consumption is 50 s. The confusion matrix of the prediction result is shown in Figure 12.
Appl. Sci. 2018, 8, 1346 19 of 25 In practical applications, the choice of the k value should not be too small or too large, because the prediction results are very sensitive to the value of k [47]. For example, we choose a few k values, such as 7, 10, 13, 15, and 20, and the accuracy results are 85.24%, 88.44%, 86.67%, 85.42%, and 81.33%, respectively, which illustrates that the accuracy of each prediction with different k values is quite different. To find the k nearest neighbor points quickly, we use the ball-tree [48]. The ball-tree is suitable for high-dimensional problems, generally when the feature dimension is greater than 20 [49]. In this paper, the dimension of the dataset is 784. After adopting the ball-tree in KNN, the prediction accuracy is 94.04% and the time consumption is 50 s. The confusion matrix of the prediction result is shown in Figure 12. As depicted in Figure 12, the precision of the normal (NM) and disproportional spacing (DS) states is 1, which is the highest. The precision of the broken strand of the left rope (BS-LR) is 0.83, which is the lowest yielded result. The prediction results show that the main prediction errors occur among similar fault types. For example, for the 127 BS-LR faults, 17 are predicted as the broken strand of the right rope (BS-RR) type; for 131 BS-RR faults, 10 are predicted as the BS-LR type; and for 124 double broken rope (D-BR) faults, 9 are predicted as the right broken rope (R-BR) type. The results demonstrate that KNN has some shortcomings in distinguishing similar fault types (consistent with As depicted in Figure 12, the precision of the normal (NM) and disproportional spacing (DS) states is 1, which is the highest. The precision of the broken strand of the left rope (BS-LR) is 0.83, which is the lowest yielded result. The prediction results show that the main prediction errors occur among similar fault types. For example, for the 127 BS-LR faults, 17 are predicted as the broken strand of the right rope (BS-RR) type; for 131 BS-RR faults, 10 are predicted as the BS-LR type; and for 124 double broken rope (D-BR) faults, 9 are predicted as the right broken rope (R-BR) type. The results demonstrate that KNN has some shortcomings in distinguishing similar fault types (consistent with the hypothesis analysis in Section 5.1), and the accuracy is lower than that of the CNN algorithm.
(2) ANN-BP The ANN-BP is a typical model that uses an error back-propagation algorithm to train the weights and biases of each neuron, and it contains several layers (i.e., input layer, output layer, and hidden layers) [50]. The ANN-BP has a relatively simple structure, and thus it has been widely used in fitting nonlinear continuous functions and pattern recognition [51].
The training and testing processes are shown in Figure 13 using the same structure as the CNN's fully connected layer (784-200-64-32-9) and the same network settings proposed in this paper (i.e., using the hold-out method; the learning rate is set to 0.01; the batch size is set to 5; the iteration is set to 40 epochs; the SGD algorithm is used, etc.). According to Figure 13, the testing accuracy is 96.44%, lower than the diagnostic accuracy of the CNN, showing the importance of the convolutional operation of CNN in feature extraction. Compared with Figure 10, it can be seen that the iterative process of the designed CNN model is more stable than that of the ANN-BP. To study the influence of the number of hidden layers and the number of nodes per layer on the performance of ANN-BP, we attempt to fine-tune the structure of the network in order to study its prediction performance. The three hidden layers of ANN-BP are denoted as HL1, HL2, and HL3, respectively. Firstly, the number of hidden layers is changed, including HL1, HL2, HL3, HL1HL2, HL2HL3, and HL1HL3, and the prediction accuracy results are shown in Figure 14. Then, the number of nodes in each layer is changed, (i.e., HL1 is varied from 180 to 220, HL2 from 44 to 84, and HL3 from 12 to 52), and the prediction accuracy results are shown in Figure 15 To study the influence of the number of hidden layers and the number of nodes per layer on the performance of ANN-BP, we attempt to fine-tune the structure of the network in order to study its prediction performance. The three hidden layers of ANN-BP are denoted as HL1, HL2, and HL3, respectively. Firstly, the number of hidden layers is changed, including HL1, HL2, HL3, HL1HL2, HL2HL3, and HL1HL3, and the prediction accuracy results are shown in Figure 14. Then, the number of nodes in each layer is changed, (i.e., HL1 is varied from 180 to 220, HL2 from 44 to 84, and HL3 from 12 to 52), and the prediction accuracy results are shown in Figure 15.
Through analysis, we find that ANN-BP is sensitive to the number of network layers and the number of nodes in each layer, and the prediction accuracy does not reach 100%, meaning that the prediction accuracy of ANN-BP is less than that of the CNN proposed in this paper. To visualize the prediction results, we display the confusion matrix of ANN-BP in Figure 16. According to Figure 16, the precision of NM, DS, twisted rope (TR), BS-LR, BS-DR, and D-BR is 1, which is the highest.
The precision of the left broken rope (L-BR) is 0.77, which is the lowest value attained. The prediction results show that the main prediction errors occur among similar fault types. For example, among the 134 L-BR faults, 19 are predicted as the R-BR type and 12 are predicted as the D-BR type; among the 121 R-BR fault types, 6 are predicted as the D-BR type. The results show that ANN-BP and KNN have some deficiencies in distinguishing similar fault types, which is consistent with the hypothesis analysis in Section 5.1. To study the influence of the number of hidden layers and the number of nodes per layer on the performance of ANN-BP, we attempt to fine-tune the structure of the network in order to study its prediction performance. The three hidden layers of ANN-BP are denoted as HL1, HL2, and HL3, respectively. Firstly, the number of hidden layers is changed, including HL1, HL2, HL3, HL1HL2, HL2HL3, and HL1HL3, and the prediction accuracy results are shown in Figure 14. Then, the number of nodes in each layer is changed, (i.e., HL1 is varied from 180 to 220, HL2 from 44 to 84, and HL3 from 12 to 52), and the prediction accuracy results are shown in Figure 15. Through analysis, we find that ANN-BP is sensitive to the number of network layers and the number of nodes in each layer, and the prediction accuracy does not reach 100%, meaning that the prediction accuracy of ANN-BP is less than that of the CNN proposed in this paper. To visualize the prediction results, we display the confusion matrix of ANN-BP in Figure 16. According to Figure 16,  To study the influence of the number of hidden layers and the number of nodes per layer on the performance of ANN-BP, we attempt to fine-tune the structure of the network in order to study its prediction performance. The three hidden layers of ANN-BP are denoted as HL1, HL2, and HL3, respectively. Firstly, the number of hidden layers is changed, including HL1, HL2, HL3, HL1HL2, HL2HL3, and HL1HL3, and the prediction accuracy results are shown in Figure 14. Then, the number of nodes in each layer is changed, (i.e., HL1 is varied from 180 to 220, HL2 from 44 to 84, and HL3 from 12 to 52), and the prediction accuracy results are shown in Figure 15. Through analysis, we find that ANN-BP is sensitive to the number of network layers and the number of nodes in each layer, and the prediction accuracy does not reach 100%, meaning that the prediction accuracy of ANN-BP is less than that of the CNN proposed in this paper. To visualize the prediction results, we display the confusion matrix of ANN-BP in Figure 16. According to Figure 16,

Comparative Analysis of Results
The results of the different algorithms evaluated in this paper are listed in Table 9. In summary, through the training and testing of the BTR dataset, the CNN model achieved a diagnostic accuracy of 100% (it could accurately identify and predict all tail rope statuses), which was higher than the 94.04% of KNN and the 96.44% of ANN-BP. The time consumption of each iteration was 32 s, with each step being 10 ms, which meets the requirements of system accuracy and real-time functioning.

Comparative Analysis of Results
The results of the different algorithms evaluated in this paper are listed in Table 9. In summary, through the training and testing of the BTR dataset, the CNN model achieved a diagnostic accuracy of 100% (it could accurately identify and predict all tail rope statuses), which was higher than the 94.04% of KNN and the 96.44% of ANN-BP. The time consumption of each iteration was 32 s, with each step being 10 ms, which meets the requirements of system accuracy and real-time functioning. Additionally, the L 2 regularization process of the fully connected layer F1 could prevent overfitting, which allowed the network to achieve a good generalization performance. Although ANN-BP had less time consumption, its accuracy and stability were worse than those of CNN. At the same time, KNN was worse than CNN in terms of accuracy and time consumption. Therefore, we can clearly conclude that the performance of CNN was better than that of KNN and ANN-BP for the health monitoring of tail ropes. Therefore, the CNN model is more suitable for the actual health monitoring of hoisting systems.

Industrial Application Plan
This paper describes a method for the health monitoring and fault diagnosis of balancing tail ropes. The object of this research was a hoisting system with two balancing tail ropes, but the same approach used in this paper can be used to construct a dataset for hoisting systems with three or more tail ropes. The industrial application plan is: first, configure the related hardware and software shown in Figure 1, and conduct explosion protection for the related devices; after the system is debugged, a large number of tail rope images are collected at the scene and the image dataset of the actual working conditions is set up; then, these data are used as the input to train the CNN or fine-tune the trained CNN. Deep learning can also be introduced into the safety monitoring of the whole hoisting system in order to realize the data mining and fault diagnosis for other key components (e.g., the drive motor, reducer, brake system, hoisting wire rope, etc.), expanding the system's applicability beyond just the tail ropes.
In our experimental environment, it took less than 10 ms to complete the prediction of one sample, and the prediction accuracy was 100%. The use of a graphics processing unit will reduce the time cost. A larger dataset will improve the generalization performance of the network, and the network prediction accuracy will also be higher and stable. Therefore, the CNN can be used in industrial applications.

Conclusions and Future Work
Aiming at the problems of high difficulty, high risk, and low recognition efficiency in the existing artificial detection methods for fault detection in BTRs, a health monitoring method for the balancing tail ropes of a hoisting system based on a convolutional neural network is proposed in this paper. In this method, the real-time tail rope images are first captured through CCD cameras and the data transmission is realized using a movable sensor network in the vertical shaft. Then, the preprocessed images are input to train the convolutional neural network in order to realize the automatic recognition of the BTR faults. Finally, fault warnings are made based on the identification results. The research can be summarized and concluded as follows: (1) A CNN including two convolution layers, two pooling layers, and three fully connected layers is proposed. The structure of the CNN is denoted as Input(28 × 28)-64C(3 × 3)-64P(2 × 2)-128C(3 × 3)-128P(2 × 2)-FC(200-64-32)-Output (9), meaning that the dimensions of the input 2D data are 28 × 28; the CNN first applies 1 convolutional layer with 64 filters and the filter size is 3 × 3. Then, one maximum-pooling layer with pooling size 2 × 2 is used. One convolutional layer with 128 filters (filter size is 3 × 3) is applied next, after which one pooling layer whose pooling size is 2 × 2 is applied. Finally, three fully connected layers whose hidden neuron numbers are 200, 64, and 32, respectively, are applied. The size of the output layer is 9, which is equal to the number of fault types.
(2) A method for the description and establishment of an image dataset that can cover the entire feature space of overhanging BTRs is proposed. The BTRs image dataset covering the 9 features in the state space is set up and further expanded to a scale of 4500 by scale and rotation to enhance the generalization ability of the network model. The same method can be used to describe data from hoisting systems containing more than two tail ropes.
(3) The CNN, KNN, and ANN-BP algorithms were used to train and test the established tail rope image dataset, and the effects of the hyper-parameters of the network diagnostic accuracies were investigated experimentally. The experimental results showed that the feature of the BTR image was adaptively extracted by the CNN's convolutional and pooling operations, which means that a great deal of manpower can be saved and online updates can be realized, so as to meet real-time requirements. The learning rate and batch size seriously affected the accuracy and training efficiency, with the better values of the learning rate and batch size being 0.01 and 5, respectively. The L 2 regularization process of the fully connected layer F1 could prevent overfitting. The fault diagnosis accuracy of CNN was 100%, while that of KNN was 94.04% and that of ANN-BP was 96.40%, so the diagnosis accuracy of CNN was much higher than that of the KNN and ANN-BP algorithms. Additionally, CNN could accurately identify and predict all kinds of BTR states, while ANN-BP and KNN had some deficiencies in distinguishing similar fault types. Therefore, the CNN had high accuracy, real-time functioning, and a good generalization performance, which are more suitable for application in the health monitoring of hoisting system BTRs. For industrial applications, future work will be to build the monitoring system's software and hardware architecture. Meanwhile, although the method proposed in this paper obtained a good performance, it also has shortcomings (i.e., if two or more fault features appear in a feature map, it may influence the recognition result). Therefore, in order to solve the problem of multi-fault coupling, the target detection of a BTR feature map based on R-CNN (regions with CNN features) will be the next research direction.