Multi-Modal Learning-Based Equipment Fault Prediction in the Internet of Things
Round 1
Reviewer 1 Report
The article is devoted to the problem of applied application of machine learning methods. The structure of the article is classical. However, it is undesirable to present figures in the Conclusions section. The article is easy to read. The level of English is acceptable. The quality of the figures varies. The article cites 30 sources, not all of which are relevant.
The following remarks can be made on the material of the article:
1. An important step in solving any ML problem and active learning in particular is choosing the optimal base model. That is, the one on the basis of which active learning will be compared with passive learning. One of the key requirements for the model should be the absence of overfitting. The fact is that active learning implies constant retraining of the model. And if it is retrained, then no matter how we choose new data, the accuracy will not increase significantly or even decrease. Of course, it would be possible to train the model from scratch at each step of the active phase, stopping the process using early stopping. But this will make the experiments too long, since instead of one epoch of retraining on new data, several tens will be required. Based on this thesis, I ask the authors to justify the choice of the basic model presented in Tab. one.
2. Usually a late merging of modalities is made. The idea is that image and text embeddings are first processed separately. This approach allows us to reduce the size of the neural network, which first extracts the necessary information from each modality, and then combines them for the final prediction. In addition, the three heads of the model (separately for text and image + combined) additionally stimulate the network to train weights - in order to extract as much relevant information as possible for classification for each of the modalities. How is this idea reflected in the author's model?
3. One of the natural and important questions that arises when building an architecture similar to the one studied by the authors is: How to calculate the loss function? The traditional answers are as follows: Simple component-by-component summation of the elements of the loss function from different heads; Weighted loss function with manual enumeration of weights; Weighted loss function with trained head weights. How do the authors justify their choice?
4. There are two important parameters. The size of the initial data set on which the model is trained during the passive phase. If this parameter turns out to be insufficient, then it will be difficult to compare the effect of active learning and retraining on random data: the accuracy will grow rapidly in both cases. If, on the contrary, the initial labeled set is made too large, then the model will be well trained already in the passive phase. Then, in the active, the increase in accuracy will be weak, regardless of the training method. The size of the request to the expert. On the one hand, you can send objects to the expert one by one. In this case, the first object in the query will maximize the criterion of the considered active learning strategy (when sorting the objects in descending order of matching the criterion). And after training on this object, the rest in the query will most likely cease to be of interest. But if you select objects one at a time, then the experiment will be delayed and the whole study will become more complicated. You can also vary the number of steps in the active learning phase. What values ​​of these parameters did the authors choose and why?
Author Response
please see attachment
Author Response File: Author Response.pdf
Reviewer 2 Report
This work has studied a multi-modal learning algorithm on equipment fault prediction in the Internet of Things. In particular, the authors propose a multi-modal learning framework that can fuse low-quality and high-quality monitoring data to predict IoT equipment faults.
The comments are summarized as follows, which may help improve this submission's quality.
- What is the definition of low-quality monitoring data and high-quality monitoring data with respect to the context of this work? How could the authors differentiate among them? Please elaborate on the manuscript.
- A summary of notation is essential to enhance the readability of the manuscript.
- Please, add a list of acronyms used throughout the manuscript.
- Could the authors improve the experimental analysis section 4.2 by adding an algorithmic procedure of the “Multi-modal Learning Algorithm”?
Author Response
please see attachment
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
I formulated the following remarks to the basic version of the article:
1: An important step in solving any ML problem and active learning in particular is choosing the optimal base model. That is, the one on the basis of which active learning will be compared with passive learning. One of the key requirements for the model should be the absence of overfitting. The fact is that active learning implies constant retraining of the model. And if it is retrained, then no matter how we choose new data, the accuracy will not increase significantly or even decrease. Of course, it would be possible to train the model from scratch at each step of the active phase, stopping the process using early stopping. But this will make the experiments too long, since instead of one epoch of retraining on new data, several tens will be required. Based on this thesis, I ask the authors to justify the choice of the basic model presented in Tab. one.
2: Usually a late merging of modalities is made. The idea is that image and text embeddings are first processed separately. This approach allows us to reduce the size of the neural network, which first extracts the necessary information from each modality, and then combines them for the final prediction. In addition, the three heads of the model (separately for text and image + combined) additionally stimulate the network to train weights - in order to extract as much relevant information as possible for classification for each of the modalities. How is this idea reflected in the author's model?
3: One of the natural and important questions that arises when building an architecture similar to the one studied by the authors is: How to calculate the loss function? The traditional answers are as follows: Simple component-by-component summation of the elements of the loss function from different heads; Weighted loss function with manual enumeration of weights; Weighted loss function with trained head weights. How do the authors justify their choice?
4: There are two important parameters. The size of the initial data set on which the model is trained during the passive phase. If this parameter turns out to be insufficient, then it will be difficult to compare the effect of active learning and retraining on random data: the accuracy will grow rapidly in both cases. If, on the contrary, the initial labeled set is made too large, then the model will be well trained already in the passive phase. Then, in the active, the increase in accuracy will be weak, regardless of the training method. The size of the request to the expert. On the one hand, you can send objects to the expert one by one. In this case, the first object in the query will maximize the criterion of the considered active learning strategy (when sorting the objects in descending order of matching the criterion). And after training on this object, the rest in the query will most likely cease to be of interest. But if you select objects one at a time, then the experiment will be delayed and the whole study will become more complicated. You can also vary the number of steps in the active learning phase. What values of these parameters did the authors choose and why?
The authors answered all of them. I liked the restrained and informative answers of the authors. I recommend the revised version of the article for printing.