3.2. Experiments Results
In the context of the rice phenomics entity classification experiment, the model based on Roberta–two-layer BiLSTM-MHA demonstrates remarkable efficacy, as evidenced in
Figure 10. The model attains an accuracy of 89.56%, signifying that a substantial proportion, approximately 89.56%, of the predicted outcomes are accurately classified. This observation substantiates the model’s high classification accuracy and its capacity to adeptly discern diverse entities within the domain of rice phenomics.
Figure 10.
Experimental results of Roberta–two-layer BiLSTM-MHA.
Figure 10.
Experimental results of Roberta–two-layer BiLSTM-MHA.
The recall rate of 86.4% signifies the model’s capacity to accurately identify positive instances, thereby reducing the occurrence of false negatives. The enhanced recall rate underscores the model’s ability to capture a substantial number of genuine positive examples during the detection of rice phenotype-related entities, leading to a reduction in false negatives.
The F1-score, a metric that integrates both accuracy and recall, achieves a score of 87.9%. The F1-score offers a comprehensive evaluation of model performance by balancing accuracy and recall, thereby preventing the bias that can arise from an emphasis on a single indicator. The high F1-score further substantiates the validity and reliability of the model in the task of classifying rice phenomics entities.
- (1)
Confusion matrix analysis
The result is shown in the confusion matrix in
Figure 11. Based on the analysis of diagonal elements from the confusion matrix, the diagonal elements have larger values. For example, the number of samples in which the class ‘Gene’ is truly ‘Gene’ and correctly predicted as ‘Gene’ reaches 539; the number of correctly predicted samples in the ‘Other’ class is correctly predicted as 386 samples, and the “Phenotype” class is correctly predicted as 288 samples. This indicates that the model has high accuracy in identifying rice phenomics entities in the categories of ‘Gene’, ‘Other’, and ‘Phenotype’ and is able to correctly distinguish most of the samples belonging to these categories.
Figure 11.
Confusion matrix for Roberta-two-layer BiLSTM-MHA experiments.
Figure 11.
Confusion matrix for Roberta-two-layer BiLSTM-MHA experiments.
Non-diagonal element analysis: in the ‘Chemical’ category, five samples were incorrectly predicted to be in the ‘Environment’ category, and eight samples were incorrectly predicted to be in the ‘Phenotype’ category. In the ‘Environment’ category, 10 samples were incorrectly predicted to be in the ‘Chemical’ category, and 9 samples were incorrectly predicted to be in the ‘Other’ category, and so on. These non-diagonal elements reflect the presence of confusing category pairs in the model. For example, there may be some similarities between the categories ‘Chemical’ and ‘Environment’ and between ‘Environment’ and ‘Other’. Other’ may have some similar features, leading to misclassification by the model.
- (2)
ROC curve analysis
From the ROC curve in
Figure 12, the curves of each classification are close to the upper left corner. The curve of ‘Class Other’ almost coincides with the upper left corner, and its AUC value reaches 1.0, which indicates that the model has excellent prediction performance in this category, and it can very accurately distinguish the samples of the ‘Other’ category from other categories. The AUC values of ‘Class Chemical’ and ‘Class Gene’ are 0.99, and the AUC values of ‘Class Environment’ and ‘Class Phenotype’ are 0.99, respectively. ‘Class Chemical’ and “Class Gene” have AUC values of 0.99; “Class Environment” and “Class Phenotype” have AUC values of 0.96 and 0.98, respectively, and the macro-average of the model has an AUC value of 0.98. This indicates that the overall model has better classification performance in different classes and is able to maintain a high true rate while ensuring a low false positive rate.
Figure 12.
ROC curves for Roberta-two-layer BiLSTM-MHA experiments.
Figure 12.
ROC curves for Roberta-two-layer BiLSTM-MHA experiments.
- (3)
Model training time and loss analysis
The loss change curves as well as the time curves are shown in
Figure 13. In the experiment of rice phenomics entity classification, at the early stage, the loss value decreases rapidly from about 8.7564352 and stabilises around the 22nd round, which indicates that the model can learn and converge quickly, and at the later stage, the loss value also decreases rapidly to around 0.01 and converges. This indicates that the model can effectively learn relevant entity features and has a good entity classification ability. The training time increases with epoch to nearly 3700 s.
Figure 13.
Plot of model training rounds versus time and loss.
Figure 13.
Plot of model training rounds versus time and loss.
3.2.1. Comparison with Traditional Classification Models
In order to comprehensively evaluate the efficacy of Roberta–two-layer BiLSTM–multiple-attention mechanism-based models for classifying entities in rice phenomics, this study carefully selected three representative models, namely logistic regression, a support vector machine (SVM), and Naive Bayes.
The selection of these models was driven by their status as highly representative of traditional machine learning algorithms. The experimental setup entailed the training and testing of all models on an identical rice phenotype dataset, a methodological choice that ensured the consistency of the experimental environment and the reliability and fairness of the experimental results. The findings from these experiments are presented in
Table 6.
In terms of accuracy, the support vector machine (SVM) demonstrates superior performance in comparison to traditional classification models; however, its accuracy is only 72.3%. Logistic regression and plain Bayes exhibit even lower accuracy. This is attributable to the fact that the majority of traditional classification models are constructed based on simple statistical assumptions and linear classification boundaries. For instance, logistic regression essentially classifies samples using linear functions, making it difficult to accurately capture the complex nonlinear relationships that exist in abundance in rice phenomics data. The phenotypic features of rice, such as plant height, leaf colour, spike shape, etc., are often interrelated and affected by a variety of factors, such as environmental and genetic factors, etc. Logistic regression is unable to effectively deal with these complex interactions of features, resulting in frequent errors in determining the entity category.
Table 6.
Experimental results comparing traditional classification models.
Table 6.
Experimental results comparing traditional classification models.
Model | Accuracy | Recall | F1-Score |
---|
Logistic regression | 68.5% | 63.2% | 65.7% |
Support vector machine | 72.3% | 68.9% | 70.5% |
Naive Bayes | 65.8% | 60.5% | 63.0% |
Roberta–two-layer BiLSTM-MHA | 89.56% | 86.4% | 87.9% |
In terms of recall, the performance of traditional models is also unsatisfactory. The recall rate of logistic regression is 63.2%; that of plain Bayes is 60.5%, and that of the support vector machine is relatively high, but it is only 68.9%. This clearly shows that the traditional model is prone to miss a large number of real samples in the actual process of rice phenotypic entity identification. Taking the recognition of rice phenotypic entities at specific growth stages as an example, due to the insufficient depth of feature mining in traditional models, some entities with subtle feature differences, such as the yellowing phenotype of leaves caused by different degrees of pigment deficiency, are easily overlooked.
The F1-score, which is a combination of accuracy and recall, highlights the excellence of the Roberta–two-layer BiLSTM-MHA model. The F1-score of this model is 87.9%, which is much higher than that of traditional classification models, proving its leading position in terms of comprehensive performance and its ability to perform the task of classifying rice phenomics entities more efficiently and accurately.
3.2.2. Comparison with Other Deep Learning Models
In addition to comparing with traditional classification models, this study also compares with other common deep learning models based on the Roberta–two-layer BiLSTM-MHA model, which include simple convolutional neural networks (CNNs), recurrent neural networks (RNNs), and BERT models based on the Transformer architecture. All models were trained and tested in the same experimental environment to ensure the scientific validity of the comparison results. The performance comparison of the models is shown in
Table 7.
Table 7.
Comparison with other deep learning models.
Table 7.
Comparison with other deep learning models.
Model | Accuracy | Recall | F1-Score |
---|
CNN | 75.4% | 70.1% | 72.6% |
RNN | 78.2% | 73.5% | 75.8% |
BERT | 81.3% | 78.6% | 79.9% |
Roberta–two-layer BiLSTM-MHA | 89.56% | 86.4% | 87.9% |
From the loss curves of the different models shown in
Figure 14, the loss values of each model are high at the beginning and then decrease rapidly as the epoch progresses. In this study, the model decreases slowly in the early stage and accelerates in the middle and late stages, while the CNN, RNN, and BERT converge slightly slower, and the final loss is slightly higher. This suggests that the model in this study has better fitting and generalisation ability in classifying rice phenomics entities.
Figure 14.
Comparison of curve diagrams of different models.
Figure 14.
Comparison of curve diagrams of different models.
The experimental results presented in
Table 7 are analysed as follows: The CNN model demonstrates significant advantages in the processing of image-based data. However, in the context of rice phenomics entity classification, the model achieves an accuracy of 75.4%, a recall of 70.1%, and an F1-score of 72.6%. This suboptimal performance can be primarily attributed to the fact that CNNs are initially designed to extract local spatial features of an image and capture edges, textures, and other information by sliding a convolutional kernel over the image. However, the rice phenotypic data are mostly presented in text form, and the data features are not the spatial structure of the image, so it is difficult for the CNN to extract effective semantic features from them, which leads to a great limitation of classification performance.
The RNN model has been shown to be advantageous in the processing of sequential data and is capable of capturing the forward and backward dependencies of the data. However, in this experiment, the accuracy was found to be 78.2%; the recall was 73.5%, and the F1-score was 75.8%. While the RNN model has been shown to be capable of modelling the sequence information of rice phenotypic data, it is evident that there are limitations in its learning ability for complex rice phenotypic features, such as those involving multi-factor interactions. Furthermore, it has been observed that the RNN model is prone to the problem of gradient disappearance or gradient explosion with increasing sequence length, which has a significant impact on the model’s training effect.
The BERT model, an archetypal exemplar of the Transformer architecture, has seen extensive deployment in the domain of natural language processing (NLP). In the present experiment, the BERT model demonstrated an accuracy of 81.3%, a recall of 78.6%, and an F1-score of 79.9%. In comparison with the BERT model, the Roberta-based two-layer BiLSTM-MHA model exhibited substantial enhancements in accuracy, recall, and the F1-score. These enhancements are primarily attributable to the two-layer BiLSTM’s capacity to model sequences in both forward and backward directions, thereby comprehensively exploring bi-directional sequence features. In contrast, the multi-attention mechanism facilitates parallel focus on diverse input sequences and the weighting of different feature dimensions, enabling more effective learning of complex patterns in rice phenotypic data.
In summary, the Roberta–two-layer BiLSTM-MHA-based model demonstrates superior performance in the rice phenomics entity classification task, both in comparison with traditional classification models and other deep learning models.
3.2.3. Analysis of the Contribution of the Components of the Model
- (1)
Analysis of the validity of Roberta’s pre-training model
In the domain of natural language processing, pre-training models have been identified as a pivotal component in enhancing the efficacy of subsequent downstream tasks. In the context of rice phenomics entity classification, this study aims to provide a comparative analysis of the performance of a Roberta pre-trained model, a BERT pre-trained model, and a model that has not undergone any pre-training. The experimental setup and the specific data are presented in
Table 8.
The experimental results indicate that the combination of the pre-trained model of Roberta performs optimally in all indicators. Roberta has acquired extensive semantic and syntactic knowledge in substantial text data through large-scale unsupervised learning. In rice phenotype-related text processing, it is able to profoundly understand the semantics and accurately capture subtle features. For example, when judging the phenotype of a rare rice disease, Roberta can quickly identify the key feature words in the text based on the pre-trained knowledge reserve, which provides a high-quality basis for the subsequent processing of BiLSTM and the multi-head attention mechanism.
Table 8.
Experimental results of word embedding model effectiveness analysis.
Table 8.
Experimental results of word embedding model effectiveness analysis.
Model | Accuracy | Recall | F1-Score |
---|
Roberta–two-layer BiLSTM-MHA | 89.56% | 86.4% | 87.9% |
BERT–two-layer BiLSTM-MHA | 81.3% | 78.6% | 79.9% |
two-layer BiLSTM-MHA | 72.8% | 70.1% | 70.5% |
As a classic pre-trained model, BERT performs well in natural language processing tasks. However, in the specific domain of rice phenotyping, its ability to learn domain-specific knowledge is slightly weaker than Roberta. When dealing with the complex text associated with rice genes and phenotypes, BERT’s understanding of some terminology and domain-specific expressions is not deep enough, resulting in some misclassifications during the classification process and inferior performance to the Roberta-based model.
The two-layer BiLSTM-MHA, which does not use a pre-trained model, can only rely on the subsequent components to extract simple features from the original text. In the face of the complex semantic information of rice phenotypes, due to the lack of in-depth understanding of the semantics, the key features cannot be effectively extracted, resulting in a significant decline in classification performance, and the accuracy, recall, and F1-scores are much lower than those of the pre-trained model.
- (2)
Two-layer BiLSTM Effectiveness Analysis
When processing sequence data, BiLSTM can model the sequence from both forward and reverse directions, fully exploiting bidirectional sequence features. In order to determine the optimal number of BiLSTM layers, we carried out comparative experiments with different numbers of layers, and the specific data are shown in
Table 9.
One-layer BiLSTM is able to learn certain contextual dependencies and has some extraction ability for simple rice phenotypic sequence features. However, in the face of complex rice phenotypic semantics, such as those involving phenotypic features under the influence of multiple environmental factors, its feature extraction capability is limited, and it cannot fully mine the key information in the sequences, resulting in relatively low performance.
Table 9.
Bilayer BiLSTM as well as BiLSTM layer effect analysis.
Table 9.
Bilayer BiLSTM as well as BiLSTM layer effect analysis.
BiLSTM Layer | Accuracy | Recall | F1-Score |
---|
One layer | 80.2% | 77.5% | 78.5% |
Two-layer | 89.56% | 86.4% | 87.9% |
Three-layer | 85.6% | 83.2% | 83.7% |
On the basis of one layer, two-layer BiLSTM further deepens the mining of contextual information. It can learn more advanced semantic features, such as complex associations between different rice phenotypes. When analysing the relationship between the rice growth cycle and phenotypic changes, two-layer BiLSTM is able to better capture the feature changes on the time series and significantly improve the classification performance of the model.
The characteristics of the three-layer BiLSTM model can be seen from the curves in
Figure 15. Although the three-layer BiLSTM has the theoretical potential to learn more complex features, in terms of the actual training process, the overfitting problem is more prominent due to the large number of layers of the network it comes with, as well as the fact that it is again fused with other models (Roberta and the multi-head attention mechanism).
As shown in
Figure 15a, in the pre-training period, the accuracy of the training set of the three-layer BiLSTM rises rapidly, while the accuracy of the test set begins to decline after reaching a certain level. This indicates that the model overcaptures the noise in the training data during the training process and overlearns the training data, which leads to the weakening of the model’s generalisation ability on new data such as the test set, which then leads to the difficulty of continuing to improve the accuracy of the test set or even a decline, so the overall performance of the model will also be affected. Looking at
Figure 15b, in terms of the loss curve, the initial loss of the three-layer BiLSTM model is high and decreases quickly in the first 15 rounds, but after 15 rounds, the loss rises and fluctuates greatly due to the start of overfitting. Compared to the one-layer and two-layer models, their final loss values begin to show an upward fluctuating trend. As for the training time, it can be clearly seen from the three training time curves that the training time curve of the three-layer BiLSTM is located at the top, and with the increase in the number of training rounds, the required training time is significantly more than that of the one-layer and two-layer models.
Figure 15.
Comparison of the effect of the number of layers of BiLSTM models on the training time, loss, and accuracy. (a) Variation in accuracy between the test and training sets of three-layer BiLSTM; (b) Variation in BiLSTM loss and consumption time for different number of layers.
Figure 15.
Comparison of the effect of the number of layers of BiLSTM models on the training time, loss, and accuracy. (a) Variation in accuracy between the test and training sets of three-layer BiLSTM; (b) Variation in BiLSTM loss and consumption time for different number of layers.
- (3)
Impact of Multi-head Attention Mechanisms
The multi-head attention mechanism plays a key role in improving the model’s ability to learn complex data patterns. It achieves a more comprehensive and deeper understanding of the data by focusing on different parts of the input sequence in parallel and assigning weights to different feature dimensions. In order to explore the specific impact of the multi-head attention mechanism in depth, this study conducted comparative experiments with and without the multi-head attention mechanism as well as with different numbers of heads (2, 4, 8, and 16), as shown in
Table 10.
When the model lacks the multi-head attention mechanism, it can only rely on BiLSTM to extract features according to a fixed pattern. When facing the complex semantic and feature relationships in the rice phenotypic data, it is not possible to strengthen and filter the important features in a targeted way, resulting in the insufficient mining of key information by the model and a significant decline in classification performance.
Table 10.
Results of the analysis of the effectiveness of the multi-attention mechanism.
Table 10.
Results of the analysis of the effectiveness of the multi-attention mechanism.
Model | Accuracy | Recall | F1-Score |
---|
Roberta–two-layer BiLSTM | 82.1% | 80.3% | 80.3% |
Roberta–two-layer BiLSTM-MHA (2 heads) | 84.3% | 82.5% | 82.5% |
Roberta–two-layer BiLSTM-MHA (4 heads) | 86.7% | 85.1% | 85.1% |
Roberta–two-layer BiLSTM-MHA (8 heads) | 89.56% | 86.4% | 87.9% |
Roberta–two-layer BiLSTM-MHA (16 heads) | 85.3% | 84.6% | 84.7% |
When the multi-head attention mechanism is set to two heads, although the model can pay attention to some features in parallel, due to the limited number of heads, the dimension of attention to features is not broad enough. When dealing with rice phenotypic data, only some more obvious key features can be captured, and it is difficult to effectively identify those features hidden in complex semantics and data relationships. This makes the model show some limitations when facing the task of complex phenotype classification, and the performance indexes have obvious gaps compared to those of models with higher head counts.
As the number of heads increased to four, the model’s ability to learn features improved. It is able to pay attention to relevant information in rice phenotypic data from multiple perspectives; for example, when analysing the association between rice pests and diseases and phenotypes, it is able to capture more feature relationships at different levels, which makes the accuracy, recall, and F1-score of the model improve to a certain extent. The model achieves the best balance of performance when set to the 8-head multi-head attention mechanism.
It can pay attention to the input sequences in parallel from a richer perspective; comprehensively capture multiple features such as morphology, colour, growth environment, etc., in the rice phenotypic data; and effectively learn the relationship between these complex features through weighted integration. This makes the model perform well in various performance indicators, finding a better balance between feature diversity and computational resource consumption and achieving high classification performance with relatively reasonable computational resource consumption.
When the multi-head attention mechanism is set to 16 heads, due to the excessive number of heads, the dimension of information that the model needs to process increases dramatically, resulting in a sharp increase in computational resource consumption. The excessive number of heads also makes the model susceptible to some noise information during the training process, which affects the generalisation ability of the model to a certain extent, leading to a decrease in the accuracy and F1-score compared to the eight-head scenario.
- (4)
Analysis of the effectiveness of different loss functions
This experiment aims to explore the effectiveness of different loss functions on the model. It was analysed, respectively, using the cross-entropy loss, focal loss, and label smoothing loss functions for experimental comparison. The experimental results are shown in
Table 11. As well as loss, consumed time and F1-score variations with epoch are shown in
Figure 16.
Table 11.
Evaluation metrics for different loss functions.
Table 11.
Evaluation metrics for different loss functions.
Loss Function | Accuracy | Recall | F1-Score |
---|
Cross-entropy loss function | 87.1% | 85.6% | 86.7% |
Focus loss function | 89.56% | 86.4% | 87.9% |
Label smoothing loss function | 86.7% | 85.2% | 86.2% |
Figure 16.
Comparison of loss, the F1-score, and the elapsed time for different loss functions. (a) Variation in loss and consumption time for different loss functions; (b) Variation in F1-score with different loss functions.
Figure 16.
Comparison of loss, the F1-score, and the elapsed time for different loss functions. (a) Variation in loss and consumption time for different loss functions; (b) Variation in F1-score with different loss functions.
From
Figure 16a, it can be seen that the loss values of the three loss functions are higher at the beginning of training but start to decline with the increase in training rounds; the focus loss function and the cross-entropy loss function decline faster than the label smoothing loss function; the focus loss function tends to stabilise after about 22 rounds, and the cross-entropy loss function tends to stabilise after about 25 rounds. The label smoothing loss function decreases slowly in the early stage and stabilises at a slightly higher value near 0.5 in the late stage; in terms of the time consumed for training, the label smoothing loss function is the longest, and the cross-entropy loss function is the shortest. From the F1-score change in
Figure 16b, the initial value of the F1-score using focus loss is higher, near 0.68, and as the number of rounds increases, the F1-score also starts to rise. In terms of the evaluation indexes shown in
Table 11, the accuracy, recall, and F1-scores of the focal loss function are 89.56%, 86.4%, and 87.9%, respectively, which are the best performers in terms of the three loss functions in this study, and the label smoothing loss function is relatively weak, with an accuracy rate of 86.7%, a recall rate of 85.2%, and an F1-score of 86.2%. The reason is that the focus loss function learns by focusing on difficult samples, which has obvious advantages in dealing with problems such as data imbalance; the label smoothing loss function is time-consuming to train, and the evaluation indexes are not outstanding, but it can smooth the labels to prevent overfitting.
3.3. Discussion
In Roberta–two-layer BiLSTM-MHA-based rice phenomics entity classification experiments, parameter settings have a key impact on model performance. In this section, this study will explore in detail the performance of the model in terms of accuracy, recall, and the F1-score for different values of the learning rate (lr), random inactivation rate (dropout), batch size, and optimiser in order to determine the optimal parameter combinations and analyse their effectiveness.
- (1)
The influence of the learning rate (lr)
The learning rate, a pivotal hyperparameter in the model training process, directly influences the step size of the model during parameter updating. In essence, the learning rate modulates the pace of the model along the gradient of the loss function during the optimisation process. During the training process of the deep learning model, the model calculates the gradient of the loss function to the parameters through the back-propagation algorithm and then updates the parameters according to the learning rate. This process gradually reduces the value of the loss function and makes the model’s prediction results closer to the real value. In this study, a series of different learning rates were setup for the experiment, and the results are shown in
Figure 17.
When the learning rate is set too small, such as 0.00001 and 0.00005 in the experiments, the model changes the parameters by an extremely small amount during each parameter update. This means that the model moves very slowly in the space of the loss function and requires a large number of training iterations to gradually approach the optimal solution. In practice, this leads to a significant increase in training time and resource consumption. Concurrently, the gradual nature of parameter updates hinders the model’s capacity to fully grasp the intricate characteristics and patterns embedded within the data, consequently leading to suboptimal performance metrics such as accuracy, recall, and the F1-score. To illustrate this point, consider the rice phenomics entity classification task. In this scenario, the model might encounter challenges in accurately discerning the subtle phenotypic variations influenced by distinct growth stages and environmental factors in rice, resulting in an increased number of misclassifications.
Figure 17.
Impact of the learning rate on model performance.
Figure 17.
Impact of the learning rate on model performance.
On the contrary, when the learning rate is set too large, e.g., 0.006 and 0.01, the step size of the model during parameter update is too large. This makes the model skip the optimal solution in the loss function space or even oscillate violently around the optimal solution and fail to converge stably. In this case, the training process of the model becomes unstable, and the value of the loss function fails to decrease consistently and may even rise. This results in the model failing to learn effective features, leading to a significant drop in various performance metrics. For example, in the experiment, the model may have a great deviation in the classification of some samples, incorrectly classifying samples originally belonging to a certain phenotype category to other categories, which seriously affects the classification accuracy of the model.
The learning rate of 0.0001, on the other hand, showed the best results in this experiment. This value allows the model to maintain a certain update speed when the parameters are updated without missing the optimal solution because of the large step size. The model is able to converge to a better state in a relatively short period of time and fully learn the features and laws in the rice phenotypic data. In practice, this means that the model is able to accurately identify various phenotypic entities of rice, regardless of whether they are common phenotypes or special phenotypes with subtle differences, thus providing reliable support for the study of rice phenomics.
- (2)
Impact of dropout
The random deactivation rate (dropout) is a widely used regularisation technique in deep learning, the core principle of which is to randomly ‘dropout’ some of the neurons and their connections in the neural network with a certain probability during the model training process. This operation breaks the complex co-adaptation between neurons, so the model cannot overly rely on some specific neuron connection patterns, thus improving the generalisation ability of the model and effectively preventing the occurrence of the overfitting phenomenon. The experimental results for different dropout values are plotted in
Figure 18.
From
Figure 18, we can see that when the dropout value is small, such as 0.2 in the experiment, the model discards fewer neurons during the training process, and most of the connections between neurons are preserved. This makes it easy for the model to learn some detailed features, or even noisy information, in the training data, thus exhibiting higher accuracy on the training set. However, this over-learning of the training data leads to the model being unable to accurately recognise new sample patterns when faced with the test set, and overfitting occurs, resulting in a decrease in accuracy, recall, and the F1-score on the test set. For example, in the rice phenomics entity classification task, the model may remember the special phenotypic features of some rice samples in the training set, but these features are not universal, and when it encounters rice samples with different growing environments or other disturbing factors in the test set, the model will make classification errors.
Figure 18.
Experimental results for different dropout values.
Figure 18.
Experimental results for different dropout values.
With the gradual increase in the dropout value, the number of neurons discarded by the model increases; the co-adaptation between neurons is effectively suppressed, and the overfitting phenomenon is alleviated. In the experiment, when the dropout value is 0.3, the performance of the model on the test set is improved, and all the indexes are improved. This indicates that a moderate increase in the dropout value helps the model learn more general features and enhances the generalisation ability of the model.
However, when the dropout values are too large, such as 0.6 and 0.7, the model discards too many neurons during the training process, resulting in a serious impact on the model’s learning ability. The model cannot fully learn the effective features in the data, and the underfitting phenomenon occurs. At this time, the accuracy, recall, and F1-score of the model on both the training and test sets are significantly reduced. For example, when classifying rice phenotypes, the model may not be able to accurately capture some key phenotypic features, incorrectly classify rice samples of different phenotypes into the same category, or fail to identify samples of some special phenotypes, resulting in a significant increase in the classification error rate.
In this experiment, the dropout value of 0.4 showed the best effect. This value finds a good balance between preventing overfitting and maintaining the learning ability of the model. It can effectively inhibit the co-adaptation between neurons to avoid the model over-learning the noisy information in the training data but also ensure that the model has enough neuron connections to learn the effective features in the data.
- (3)
Influence of batch size
Batch size refers to the number of samples input into the model during each training, which plays a pivotal role in the model training process and has a significant impact on the training efficiency, stability, and final model performance. In this study, batch sizes of 8, 16, 64, 128, 256, and 512 were selected for a series of experiments, and the experimental results are shown in
Figure 19.
Figure 19.
Experimental results of the effect of different batch size values on the performance of the model.
Figure 19.
Experimental results of the effect of different batch size values on the performance of the model.
As demonstrated in
Figure 18, when the batch size is set to a small value, such as 8 or 16, as was the case in the experiments, there is a limited number of samples on which each model parameter update is based. This results in large fluctuations in the estimate of the gradient when calculating the gradient due to the limitations of the samples, leading to unstable parameter updates. This phenomenon can be likened to an attempt to locate the optimal solution within a confined space, where local deviations can easily lead to a loss of direction. This instability in the parameter update process hinders the model’s ability to capture global features in the data, consequently impairing its generalisation capabilities. In the rice phenomics entity classification task, the model may not accurately identify phenotypic features that exhibit subtle differences under varying growth environments and developmental stages, consequently leading to suboptimal accuracy, recall, and F1-score metrics. Additionally, due to the limited number of samples processed in each iteration, the model requires extensive training, which will considerably extend the training time and reduce the training efficiency.
However, as the batch size increases, the situation improves. When the batch size is 64, each parameter update is based on more samples; the gradient is estimated more accurately, and the stability of the parameter update is significantly improved. The model is able to learn from more data features, which improves the performance of the model to a certain extent, and all the indexes have increased. Consequently, the model exhibits enhanced stability in its progression towards the optimal solution during the training process, and the learning of rice phenotypic features is characterised by greater comprehensiveness and depth.
However, when the batch sizes are set too high (for example, 256 and 512), the model is more likely to find a local optimal solution, although the stability of parameter updating is enhanced. This is because when the data volume is high, the model is more likely to be attracted by the local optimal solution during the optimisation process, and it is difficult to find the global optimal solution. To illustrate this point, consider the rice phenotypic classification model. It has been observed that the model may become overly reliant on certain common phenotypic feature patterns, thereby overlooking unique cases or subtle feature distinctions. This can result in a decline in the model’s classification capability and a decrease in accuracy, recall, and the F1-score when confronted with diverse test samples. Moreover, the utilisation of overly large batch sizes can impose challenges related to computational resources, necessitating greater memory to store and process data. This can potentially become a constraining factor in scenarios where hardware resources are limited.
In this experiment, a batch size of 128 was determined to be an optimal balance between training efficiency, stability, and model performance. This ensured that there were sufficient samples to support each parameter update, thereby enhancing the accuracy of gradient estimation and ensuring the stability of parameter updates. Additionally, this approach prevented the problem of falling into a local optimal solution due to a large batch size. The model consumes reasonable computational resources, facilitating the comprehensive learning of various features in rice phenotyping data. This results in optimal performance metrics, including accuracy, recall, and the F1-score, thereby ensuring the reliable classification of rice phenomics entities.
- (4)
Optimiser Impact
Optimizers take on the key responsibility of tuning model parameters to minimise the loss function in deep learning model training, and the characteristics and strategies of different optimizers can significantly affect the training process and final performance of the model. In this study, SGD, Adam, and Adagrad are selected to conduct experiments, and the experimental results are shown in
Figure 20.
Figure 20.
Results of the impact of different optimisers on the performance of the model.
Figure 20.
Results of the impact of different optimisers on the performance of the model.
Stochastic gradient descent (SGD) is a fundamental optimiser that calculates the gradient based on a small batch of samples randomly selected in each iteration, thereby updating the model parameters. In principle, SGD considers only the gradient information of the current small batch of samples in each parameter update, a method that is computationally simple and efficient. However, this approach is inherently limited. The random selection of samples for updating purposes introduces significant noise during gradient estimation, resulting in unstable parameter updates and frequent oscillations during the training process. In the rice phenomics entity classification experiment, the SGD optimiser demonstrates instability in the loss function value of the model during training, hindering the ability to decrease steadily. This hinders the model’s ability to converge on a more optimal solution, as evidenced by the experimental findings, which demonstrate an accuracy of 75.6%, a recall of 72.9%, and an F1-score of 74.2%. In the context of rice phenomics entity classification, the model frequently misidentifies similar phenotypic entities, leading to an inaccurate classification. This issue significantly impacts the practical application of the model.
Adagrad (Adaptive gradient descent) is an optimiser that improves SGD (stochastic gradient descent) by adaptively adjusting the learning rate of each parameter. This is based on the cumulative sum of the squares of the gradients of each parameter in previous iterations, with the learning rate of frequently updated parameters being decreased and the learning rate of less frequently updated parameters being increased. This adaptive adjustment of the learning rate renders the model more reasonable for updating parameters during the training process, thereby enhancing its stability to a certain extent. However, Adagrad is not without its drawbacks. It is notable that Adagrad constantly accumulates the square of the gradient during the calculation process, resulting in a decay of the learning rate with training, which can lead to an insufficient learning rate at a later stage and a slower convergence of the model. In this experiment, while Adagrad exhibited relative stability during training and did not demonstrate violent oscillations as observed in SGD, its convergence speed was notably slower. The resulting model demonstrated an accuracy of 82.5%, a recall of 79.8%, and an F1-score of 81.1%, all of which were lower than the performance of the optimal optimiser. This suggests that, in practice, the model requires a greater investment of time to achieve a satisfactory level of performance, and its classification performance is still inferior to that of other optimisers of a higher calibre.
The Adam (adaptive moment estimation) optimiser synthesises the merits of Adagrad and RMSProp in that it not only adaptively adjusts the learning rate for each parameter but also employs momentum to accelerate convergence. Adam considers both the first-order moments of the gradient (the mean) and the second-order moments of the gradient (the uncentered variance) when calculating the gradient and dynamically adjusts the learning rate through the estimation of these two moments. In the rice phenomics entity classification experiments, the Adam optimiser demonstrated excellent performance, exhibiting a rapid convergence rate and the capacity to swiftly reduce the loss function value at the commencement of the training period, thereby enabling the model to expeditiously approach the optimal solution. Concurrently, the Adam optimiser exhibited remarkable stability during the training process, with the loss function value diminishing in a smooth and continuous manner, thus circumventing the oscillation problem engendered by the unstable gradient. The model utilising the Adam optimiser attains 89.56% in accuracy, 86.4% in recall, and 87.9% in the F1-score, thereby demonstrating superiority over the SGD and Adagrad optimisers across all performance metrics. This facilitates the accurate identification of a range of phenotypic features during the classification of rice phenotypic entities, encompassing both prevalent phenotypes and distinctive ones with nuanced variations. The high-precision classification it enables provides substantial support for the study of rice phenomics.
- (5)
Exploration of the possibility of using the model for phenotyping other objects
From the perspective of the model architecture itself, the Roberta-based two-layer BiLSTM-MHA model has efficient text feature extraction and semantic understanding capabilities. In the rice phenotype classification task, it can effectively extract the semantic features of the rice phenomics text and realise efficient classification. This provides a certain foundation for applying it to the classification of phenotypic entities in other domains. For different types of phenotypes, as long as their phenotypic information can be represented in the form of textual data, the model is theoretically applicable. For example, in the field of botany, the phenotypic characteristics, such as morphology, physiology, and biochemistry, of crops other than rice, such as wheat, can also be recorded through textual descriptions. The model can process these texts and mine the phenotypic information embedded in them. In the field of zoology, animal behavioural characteristics, physiological indicators, and other phenotypic descriptive texts may also be analysed with the help of the model.
However, the use of models for phenotype categorization in other domains requires some additional considerations. First, the phenotypic expertise systems of different domains are quite different, which requires high generalisation ability of the model. The parameters of the current model are derived from training based on rice phenotypic data, and a lot of pre-training and fine-tuning is required for use in other domains. Second, the quality of the data as well as the size of the data volume is also important. Phenotypic classification in other domains may lack sufficient high-quality labelled data, and missing or incomplete data can also lead to biassed predictions of the model in learning semantic features.
3.5. Generalisation Experiment
In order to validate the generalisation ability of the rice phenomics entity classification model based on Roberta–two-layer BiLSTM-MHA, this study uses the publicly available dataset CLUENER2020 to carry out generalisation experiments.
- (1)
Introduction of the dataset
CLUENER2020 contains 10 different categories, namely organisation, name, address, company, government, book, game, movie, position, and scene. As demonstrated in
Table 12, this dataset boasts extensive domain coverage and diversity, thus providing an effective means to assess the generalisation performance of the model on diverse entity types.
- (2)
Experimental results and analysis
Following the implementation of multiple training and testing cycles, the model’s performance metrics for various categories have been obtained, as illustrated in
Figure 22. And the performance evaluation of this research model in the public dataset and the dataset of this research is shown in
Table 13.
Table 12.
Comparison of model evaluation results between the self-developed dataset and the publicly available dataset CLUENER2020.
Table 12.
Comparison of model evaluation results between the self-developed dataset and the publicly available dataset CLUENER2020.
Dataset | Accuracy | Recall | F1-Score |
---|
Dataset for this study | 89.56% | 86.4% | 87.9% |
CLUENER2020 | 91.38% | 90.94% | 91.57% |
As illustrated in
Figure 23, for categories comprising a substantial number of samples, such as names, companies, and organisations, the model demonstrates a commendable level of accuracy and an F1-score that is notably high. This observation signifies that the model exhibits an exceptional capacity for recognising prevalent categories. Conversely, for categories with a limited number of samples, such as book titles and movies, the model’s performance exhibits a slight decline, which may be attributable to insufficient data, resulting in inadequate learning by the model within these categories.
Then, from
Figure 24, it can be seen that the initial loss value of this model applied on the public dataset is about 9.76645, which decreases rapidly in the first 18 rounds of training, and it tends to stabilise close to 0.01, which indicates that the model can learn the data features quickly in the pre-training period, and reaches a better convergence state in the later period, and the fitting effect is good.
Then, the training time increases linearly with the number of training rounds, increasing to nearly 4300 s, which is an increase in the consumed time compared to the dataset of rice phenomics entity classification in this study, due to the fact that the public dataset has a larger amount of data compared to that of this study, as well as a different level of complexity of the data.
Figure 23.
Breakdown of different entity categories in CLUENER2020.
Figure 23.
Breakdown of different entity categories in CLUENER2020.
Table 13.
Distribution of the training and test sets for the CLUENER2020 dataset.
Table 13.
Distribution of the training and test sets for the CLUENER2020 dataset.
Datasets | Categories | Quantities | Datasets | Categories | Quantities |
---|
training data | address | 2829 | Testing data | address | 364 |
training data | book | 1131 | Testing data | book | 152 |
training data | company | 2897 | Testing data | company | 366 |
training data | game | 2325 | Testing data | game | 287 |
training data | government | 1797 | Testing data | government | 244 |
training data | movie | 1109 | Testing data | movie | 150 |
training data | name | 3661 | Testing data | name | 451 |
training data | organisation | 3075 | Testing data | organisation | 344 |
training data | position | 3052 | Testing data | position | 425 |
training data | scene | 1462 | Testing data | scene | 199 |
Figure 24.
Plot of model training time and loss variation under the public dataset (CLUENER2020).
Figure 24.
Plot of model training time and loss variation under the public dataset (CLUENER2020).