Identifying General Stress in Commercial Tomatoes Based on Machine Learning Applied to Plant Electrophysiology

: Automated monitoring of plant health is becoming a crucial component for optimizing agricultural production. Recently, several studies have shown that plant electrophysiology could be used as a tool to determine plant status related to applied stressors. However, to the best of our knowledge, there have been no studies relating electrical plant response to general stress responses as a proxy for plant health. This study models general stress of plants exposed to either biotic or abiotic stressors, namely drought, nutrient deﬁciencies or infestation with spider mites, using electrophysiological signals acquired from 36 plants. Moreover, in the signal processing procedure, the proposed workﬂow reuses information from the previous steps, therefore considerably reducing computation time regarding recent related approaches in the literature. Careful choice of the principal parameters leads to a classiﬁcation of the general stress in plants with more than 80% accuracy. The main descriptive statistics measured together with the Hjorth complexity provide the most discriminative information for such classiﬁcation. The presented ﬁndings open new paths to explore for improved monitoring of plant health.


Introduction
Automated monitoring of plant development is becoming a key enabler of for optimized agricultural production [1,2]. The development of advanced technologies and digitalization introduces strong potential to revolutionize all sectors of activity, including agriculture and especially controlled condition crop production such as greenhouse systems. Controlled growth condition systems are typically used to grow high-value crops such as vegetables, spices or ornamental plants. These systems are highly intensive and the most capital-demanding agricultural production systems; therefore, any reduction in their productivity has an important and immediate financial impact.
Many factors contribute to the quality and quantity of crop yield, including supply of the correct amount of irrigation and nutrients, as well as the prevention of losses from pests and diseases. For successful control, early detection and identification of problems is vital. Therefore, access to a tool providing an alert of a possible "danger" to plants' health would allow prompt interventions and improved regulation of growing conditions which would help reduce crop losses. In addition, early diagnosis would lead to reduced application of agrochemicals and an increased use of environmentally friendly practices, such as biological control agents. In other words, automated plant health monitoring could lead to significantly improved yields and efficient and environmentally sustainable crop protection.
Plants, as sessile organisms, are responding and acclimating to changes in environmental conditions for survival. They have developed different signaling systems that allow integration of environmental cues to coordinate molecular processes associated to both early development and adult plant physiology [3]. Among the different signaling pathways, electrical signals, known as electrophysiology, are a widely observed phenomenon and the most efficient for rapid transfer of information over long distances. It is well established that both plants and animals utilize long-range electrical signaling to transduce environmental information to the whole body [3,4]. The reaction of plants to either biotic or abiotic stress can be identified by electrical potential variations resulting from the changes in the underlying physiological process that the stressor triggers in a plant [5][6][7]. Hence, the electrophysiological signals naturally occurring within plants have strong potential for identifying the plant health status.
Several studies employing machine learning techniques have demonstrated that the electrophysiological plant response encodes signal patterns discriminating plant status related to the applied stressor [8][9][10][11][12]. Electrophysiological signals related to various stimuli, such as environmental stressors [8], pollutants [9,10], salt abundance [11], and fungal infection [12], have been analyzed in tomato, cucumber, soybeans, cabbage and wheat. Often these investigations were conducted strictly in controlled laboratory conditions using Faraday cages, limiting an understanding of their applicability in normal agricultural environments.
Recent work [13] has demonstrated the possibility of acquiring continuous and stable long-term plant electrophysiology signals in typical greenhouse conditions. Further studies have shown that the signal recordings encode discriminative patterns for stress related to drought [13] and to the presence of spider mites [14].
To the best of our knowledge, there have not yet been reported studies based on plant electrical responses, evaluating the stressed state of the plant in a broad sense without relating it to a specific stressor. An automated identification of a general stress in plants would automatically indicate to growers that there is something wrong with their crop long in advance of visual symptoms. It could therefore lead to more effective scouting, lower costs and significantly improved monitoring of crop health.
A recent exploratory study on plant electrophysiology, introduces a methodology to classify the plant state related to the applied stimulus with an accuracy of 80% [14]. However, the related process of feature calculation i.e., extraction of the information from the recorded signal for transforming the raw signal data into a form on which the modelling of a classifier could be applied, is a time-consuming task which limits the potential use and benefits of this framework in daily agricultural practice. More precisely, the proposed framework is designed in a way so that for each step of 5 min advancing forward along the signal extent, it calculates 34 signal features of both temporal and frequency domains within seven windows of different lengths varying from 15 s to 30 min, without retaining any information from the previous steps.
Given the limitations of the current literature, the goal of the presented study is twofold. Primarily, it introduces a novel approach for identifying a stressed state, in a general sense, in tomato plants growing in a typical production environment by combining information from their electrophysiological response to different stimuli, namely drought, nutrient deficiency and infestation with spider mites. Secondly, it aims to identify the most discriminative features allowing the identification of the source of plant stress. Furthermore, by using information from previous steps in the signal processing procedure, the proposed approach tends to decrease the required computation time compared with recent approaches in related state of the art.

Experiments
The experiments were conducted in Agroscope's field station greenhouses in Conthey (Switzerland) on commercial tomato plants (var. Admiro) grown using soilless methods in coconut fiber substrate. Each of the experiments assessed the effect of different common stressors on the plant's electrophysiological response. More precisely, on different sets of tomato plants, three types of stressors were analyzed: • Drought. Normal irrigation was cut for three days. The drought period was considered to take place nine hours after irrigation removal. • Nutrient deficit. Four different nutrients were considered: manganese (Mn 2+ ), iron (Fe 2+ ), nitrogen (N), and calcium (Ca 2+ ). Deficits of nitrogen and calcium, both of them being macronutrients, often lead to a blossom-end rot [15] of the tomato fruit which consequently reduces the yield. Manganese and iron, as micronutrients, are crucial for photosynthetic machinery and therefore, their deficit triggers a major nutritional disorder that reduces the growth and causes marked losses in both yield and quality [16,17]. For different sets of tomato plants, each of these nutrients was specifically removed from the full nutrient irrigation, and each experiment was continued for several days after the visual symptoms of the related deficit appeared.

•
Infestation with spider mites. This represents one of the major pest problems of tomato cultivated in greenhouse since the highly controlled environment of greenhouses favors low-humidity diseases. Plants were infested with spider mites (T. urticae) during two weeks as previously described [14], leading to a significant infestations.
The visual symptoms in the stressed state of the plants after application of each type of stressor are shown in Figure 1. For instance, the drought is manifested by a strong wilt in the plants (Figure 1c), whereas the nutrient deficiency causes the appearance of yellow leaves (Figure 1d). A leaf highly infested with spider mites is shown in Figure 1e.

Experiments
The experiments were conducted in Agroscope's field station greenhouses in Conthey (Switzerland) on commercial tomato plants (var. Admiro) grown using soilless methods in coconut fiber substrate. Each of the experiments assessed the effect of different common stressors on the plant's electrophysiological response. More precisely, on different sets of tomato plants, three types of stressors were analyzed: • Drought. Normal irrigation was cut for three days. The drought period was considered to take place nine hours after irrigation removal. • Nutrient deficit. Four different nutrients were considered: manganese (Mn 2+ ), iron (Fe 2+ ), nitrogen (N), and calcium (Ca 2+ ). Deficits of nitrogen and calcium, both of them being macronutrients, often lead to a blossom-end rot [15] of the tomato fruit which consequently reduces the yield. Manganese and iron, as micronutrients, are crucial for photosynthetic machinery and therefore, their deficit triggers a major nutritional disorder that reduces the growth and causes marked losses in both yield and quality [16,17]. For different sets of tomato plants, each of these nutrients was specifically removed from the full nutrient irrigation, and each experiment was continued for several days after the visual symptoms of the related deficit appeared.

•
Infestation with spider mites. This represents one of the major pest problems of tomato cultivated in greenhouse since the highly controlled environment of greenhouses favors low-humidity diseases. Plants were infested with spider mites (T. urticae) during two weeks as previously described [14], leading to a significant infestations.
The visual symptoms in the stressed state of the plants after application of each type of stressor are shown in Figure 1. For instance, the drought is manifested by a strong wilt in the plants (Figure 1c), whereas the nutrient deficiency causes the appearance of yellow leaves (Figure 1d). A leaf highly infested with spider mites is shown in Figure 1e. Recordings of the electrophysiological signal of each plant were made with multichannel PhytlSigns devices, which use pairs of electrodes: an active one placed in a higher part of the plant stem and a reference (ground) electrode in the lower part [13,14]. The Recordings of the electrophysiological signal of each plant were made with multichannel PhytlSigns devices, which use pairs of electrodes: an active one placed in a higher part of the plant stem and a reference (ground) electrode in the lower part [13,14]. The placement of the active electrode and the device during acquisition is shown in Figure 1a,b, respectively. The input impedance of the device is of order of many MΩ and the sampling rate was 500 Hz.

Dataset
In total, the dataset was comprised of 36 plants with an equal representation of each stressor. More precisely, 12 plants were infested with spider mites, another 12 were subjected to drought and the last 12 the nutrient deficiencies with an equal distribution (i.e., three plants) for each of the studied nutrients. The recording durations were the same for each plant-with 24 h of normal state (before the application of the stressor) and 24 h during the stressed period, mainly after the visual symptoms triggered by the applied stimulus, were evident. Twenty-four hours represents a full circadian cycle and is therefore long enough to represent the studied plant states for each plant. The recorded signal for each type of stressor is shown in Figure 2.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 4 of 13 placement of the active electrode and the device during acquisition is shown in Figure  1a,b, respectively. The input impedance of the device is of order of many MΩ and the sampling rate was 500 Hz.

Dataset
In total, the dataset was comprised of 36 plants with an equal representation of each stressor. More precisely, 12 plants were infested with spider mites, another 12 were subjected to drought and the last 12 the nutrient deficiencies with an equal distribution (i.e., three plants) for each of the studied nutrients. The recording durations were the same for each plant-with 24 h of normal state (before the application of the stressor) and 24 h during the stressed period, mainly after the visual symptoms triggered by the applied stimulus, were evident. Twenty-four hours represents a full circadian cycle and is therefore long enough to represent the studied plant states for each plant. The recorded signal for each type of stressor is shown in Figure 2.

Preprocessing
Several preprocessing steps were applied to the raw electrophysiological recording. Notch filtering. To eliminate eventual noise generated by the electrical power source and its harmonics, two band-stop filters at 50 Hz and 100 Hz were applied on the signal.
Windowing and feature extraction. To extract local signal features, the recordings were divided into relatively small windows. Following a recently proposed approach [14], in each of the isolated windows, 34 features characterizing the signal information in either time or frequency domain were calculated using Matlab (version R2020a, The MathWorks Inc., Natick, Massachusetts, USA). These features could be separated in six different groups: • Group 1: minimum, maximum, variance, skewness, kurtosis, and interquartile range (IQR) of the windowed signal • Group 2: Hjorth mobility, Hjorth complexity, Generalized Hurst exponent (GHE), Shannon wavelet package entropy (wentropy), logarithmic wentropy and the root mean square (RMS) of the windowed signal • Group 3: impulse, margin, shape and crest factor • Group 4: frequency center, frequency RMS and the root variance of the frequency

Preprocessing
Several preprocessing steps were applied to the raw electrophysiological recording. Notch filtering. To eliminate eventual noise generated by the electrical power source and its harmonics, two band-stop filters at 50 Hz and 100 Hz were applied on the signal.
Windowing and feature extraction. To extract local signal features, the recordings were divided into relatively small windows. Following a recently proposed approach [14], in each of the isolated windows, 34 features characterizing the signal information in either time or frequency domain were calculated using Matlab (version R2020a, The MathWorks Inc., Natick, Massachusetts, USA). These features could be separated in six different groups: • Group 1: minimum, maximum, variance, skewness, kurtosis, and interquartile range (IQR) of the windowed signal • Group 2: Hjorth mobility, Hjorth complexity, Generalized Hurst exponent (GHE), Shannon wavelet package entropy (wentropy), logarithmic wentropy and the root mean square (RMS) of the windowed signal • Group 3: impulse, margin, shape and crest factor • Group 4: frequency center, frequency RMS and the root variance of the frequency • Group 5: similarity between the signal noise and the color noises: white, blue, brown, pink, and purple, respectively • Group 6: e minimum, maximum and mean value of the wavelet decomposition (WD) of order 8 at levels 1, 4 and 8, respectively The initial step of the windowing approach involves choosing a signal interval (or segment) ending at time-point t 0 . This segment is further divided in N windows of same length l (Figure 3). In other words, the length of the segment is N·l. After calculating the chosen features in each window, all of the extracted 34·N features extracted within the initial segment will represent the first sample of the data to model i.e., the sample at time point t 0 ( Figure 3A).
Appl. Sci. 2021, 11, x FOR PEER REVIEW 5 of 13 • Group 5: similarity between the signal noise and the color noises: white, blue, brown, pink, and purple, respectively • Group 6: e minimum, maximum and mean value of the wavelet decomposition (WD) of order 8 at levels 1, 4 and 8, respectively The initial step of the windowing approach involves choosing a signal interval (or segment) ending at time-point t0. This segment is further divided in N windows of same length l (Figure 3). In other words, the length of the segment is N•l. After calculating the chosen features in each window, all of the extracted 34•N features extracted within the initial segment will represent the first sample of the data to model i.e., the sample at time point t0 ( Figure 3A). In the second step the initial segment is shifted forward by a length l ( Figure 3B). Equivalently to the previous step, all the 34•N features extracted within this shifted segment will represent the second sample of the data for modeling i.e., the sample at time point t1. One could note that the features for the first N-1 windows were already calculated in the previous step and consequently, at this step, only the features in the last window should be computed.  The segment chosen initially is shifted for a length l, also shifting its ending to the time point t 1 . As the first N-1 features at this step represent the last N-1 features from the previous step (Features_i: t 0 , i = 2, . . . , N) they have already been calculated. Hence, the only remaining features to compute at this step are those within the last, Nth, window (Features_N: t 1 ). All of the 34·N extracted features extracted at this step now define the sample of the data for modeling at time point t 1 .
In the second step the initial segment is shifted forward by a length l ( Figure 3B). Equivalently to the previous step, all the 34·N features extracted within this shifted segment will represent the second sample of the data for modeling i.e., the sample at time point t 1 . One could note that the features for the first N-1 windows were already calculated in the previous step and consequently, at this step, only the features in the last window should be computed.
Once the second sample is defined, the segment is shifted again for the same length l and the next sample is determined in similar manner as in the previous steps. This procedure is repeated until the end of the chosen recordings, resulting in a matrix of r rows representing the data samples and c columns corresponding to the integrality of the extracted features vectors. More precisely, The proposed approach allows the processing of signal information of relatively longer lengths, for instance 30 min, while avoiding an increase in the time required to compute the features within windows of such lengths. Moreover, this process also permits each sample to include information from the past. To compare the computing time required for features extraction between the previous and the present approach, a plant electrophysiological recording of 5-days length stored at 500 Hz was taken and processed by both methodologies. To approximate the new methodology to the previous one, it was assumed that N = 7 windows of length l = 5 min.
The number of windows N, and their length l, are the two main parameters varied in the proposed method in order to optimize the outcome of the classification model. The tested values for both parameters N and l were chosen to provide an approach that would be computationally efficient and, at same time, would not require long recording durations. Hence, relatively small window lengths l were chosen, namely: 30 s, 1 min, 2 min and 5 min, which were tested for three N values: 5, 10, and 15, respectively. Such design allows the approach to be easily deployed in commercial agricultural practice.
Additionally, since the number of samples varies for different window lengths when the length of recordings is fixed, to perform a proper comparison between the window lengths, the same number of samples corresponding to r (Equation (1)) for 15 windows with length of 5 min was chosen for each case i.e., 274 samples for 24 h of recordings. For the window lengths shorter than 5 min, the 274 samples were randomly selected among all of the samples corresponding to a recording of 24 h.
Once the optimal window length was chosen, additional tests were performed to investigate the effect of number of windows N while increasing it from 1 to 30. To equalize the number of samples per plant, for each of these tests the number of rows r of the matrix data corresponded to N = 30.
Labelling. The extracted samples of the signal represent either the normal, pre-stimulus, state or the stressed state of the plant. Accordingly, the samples were labelled as '0' when the corresponding segment finished in the normal state, whereas label '1' was assigned to the samples in the stressed state. The distribution between the labels was equal in each performed test.
Normalization. To compensate for the eventual variation in the electrical response between the plants, a min-max normalization was performed to each feature vector of each plant before concatenating the corresponding matrices. This normalization brought the feature values to the interval [0, 1].

Classification
Previous preliminary analyses revealed that the Extreme Gradient Boosting (XGB) [18] algorithm performs better for the classification of plant electrophysiological data than the other tested algorithms, namely Logistic Regression, Decision Trees, Random Forest and Deep Learning [13,14]. Hence, the classification models in this study were built using XGB, which represents a gradient boosted technique optimized for speed and used computing resources. The modelling task was implemented with the XGBoost library in Python.
The model was built on the learning set and evaluated on a separated test set. The splitting of data in training and test set followed two constraints. On one side, regarding the relatively high number of samples forming the explored dataset, the common practice suggests the training set to enclose around 80% of the dataset. On the other side, we aimed at including in both sets, an equal distribution of each stressor as well as the integrity of samples obtained from an individual plant with an objective of minimizing an eventual bias in the evaluation. Therefore, data from 30 plants were incorporated in the training set, as opposed to the remaining six in the test set. To be more precise, two represented the drought, two the infestation with spider mites, and the last two the deficiency in Mn 2+ and Ca 2+ , respectively. For each set, the plants representing a specific stressor were randomly chosen.
The learning procedure included tuning of several XGB parameters: • max_depth representing the maximum depth of a tree; it took a value from the set: [ Initially, the optimal number of trees was determined with a learning rate set to 0.1 whilst keeping the default values of the other parameters. In a second step, the other parameters were tuned by applying a grid search that used a custom cross validation [14] to avoid eventual bias in the evaluation of the model performance. This cross validation used 10 folds, where each one comprised the samples of three different plants. At the end, the learning rate was reduced 10 times and the final number of trees was determined by employing the already tuned parameters.

Important Features for the Discrimination
One of the characteristics of the XGB algorithm is the ability to measure the relative importance of specific features for discriminating between classes. For instance, it is possible to evaluate the average gain across splitting brought by every calculated feature. Hence, a comparison between the most discriminative features for each tested scenario, in which either different window duration or different number of windows were used, could be realized.
Moreover, to assess the importance of each group of features, additional tests using each group separately or in combination with the others were also performed. In total, there were 62 combinations whose results were also compared with the case when taking all the six groups. This analysis was realized for the resulting optimal window length and N = 15, in order to include enough features in each case.

Results
The novel methodology for feature extraction uses substantially less computing time when compared to the previously proposed approach [14]. More precisely, the previous approach took around 4 h and 30 min to process the signal of 5 days, while the new one did the computation within 30 min which is nine times faster. The results below all use the novel methodology.
The highest classification accuracy was obtained for models build upon features extracted from windows with length of 1 min (Table 1). For 15 windows, the related accuracy reached 80.20%. As the length of the window increases, the accuracy is slightly reduced. Nevertheless, a more important drop in the performance is observed for shorter windows i.e., l = 30 s. The precision, recall and specificity are, again, highest for windows of 1 min. Moreover, only in that case, the recall and specificity are relatively close in values, which indicates that the models built on features from 1-min windows are able to equally identify the normal and the stressed state. For all other tested window lengths there is a relatively large difference between the recall and the specificity. For instance, the models related to windows of 5 min have a higher specificity than recall value and therefore are more prone to correctly detect the normal state than the stressed. The opposite can be observed for the windows of 2 min, where the recall values are higher than the specificity of the models. In the same context, although with a poorer performance, the models built on windows of 30 s show similar behavior as those related to 2 min.
In summary, these findings (shown in more details in Table 1) suggest that among the analyzed windows lengths, 1 min is the optimal one. Hence, a length of 1 min will be used for the further analyses. Table 2 shows, for each window length, the most discriminative features among the first 10 ranked by the XGB gain measure. One should note that this list of the 10 most discriminative features includes the same feature several times but calculated within different windows of the taken signal segment i.e., different columns of the matrix resulting from the feature extraction procedure.
In general, the Hjorth complexity is the feature that is highly ranked for all of the tested scenarios. Additionally, it appears as the most important for the discrimination performed by the models built on features from windows with lengths of 1 to 5 min. Subsequent to the Hjorth complexity, shape and crest factors also appear as dominant features for the classification.
The effect on number of windows on the classification performance was tested for all the samples calculated within 24 h for l = 1 min and N = 30. More precisely, each plant was represented by 2822 samples, where one half corresponded to the normal state and the other half to the stressed state. The obtained results revealed that in general, the model performance is tending to improve with the increase of the number of windows. As shown in Figure 4, all four measures, accuracy, precision, recall, and specificity calculated for the stressed state taken as a positive class, mainly follow an upward trend. However, the recall distribution is more dispersed than other three distributions, indicating that for different N i.e., different number of features, there is a relatively high difference in correctly predicting the stressed state. Hjorth complexity, Frequency RMS, Pink noise, Shape factor * WDn decomposed signal at level n resulting from the performed wavelet decomposition of order 8.
Considering the most discriminative features, Hjorth complexity, crest factor and the similarity with the brown noise appeared as most dominant ones regarding the ranking provided by the XGB gain measure.
The modelling performed for different combinations of groups of features showed that when taking three or more groups of features, the highest accuracies are mainly observed for the combinations involving either Group 2, Group 5 or both, whereas for a smaller number of groups, the highest accuracy is in general found for the combinations enclosing Group 1. Table 3 provides additional details about the performance of the models with the best accuracy regarding different numbers of features' group, whereas Table S1 in the Supplementary provides the accuracy of all the tested combinations together with the corresponding most discriminative features used by the XGB algorithm.
When analyzing each group individually, the highest accuracy, 82.56%, was observed for Group 1, which is considerably greater than for the other cases. For instance, the next in order was the model built on features from Group 5 whose accuracy reached only 67.34%. Additionally, for the models with the highest accuracy for different numbers of groups, there is a considerable difference between recall and the specificity when the Group 1 is not involved in the combination. For instance, the model built with groups 2, 3, 4, 5, and 6 performs with the highest accuracy among all of the other models, however, is fairly more efficient in predicting the stressed state than the normal state.
In the case involving all of the features' groups, the accuracy (78.78%) was slightly lower than the best results obtained with a smaller number of groups.
In terms of the XGB gain-related ranking of discrimination power, the Hjorth complexity and crest factor are overall appearing as predominant features. Additionally, when Group 2 is not enclosed, the variance is in general among the most discriminative features.  When analyzing each group individually, the highest accuracy, 82.56%, was observed for Group 1, which is considerably greater than for the other cases. For instance, the next in order was the model built on features from Group 5 whose accuracy reached only 67.34%. Additionally, for the models with the highest accuracy for different numbers of groups, there is a considerable difference between recall and the specificity when the Group 1 is not involved in the combination. For instance, the model built with groups 2, 3, 4, 5, and 6 performs with the highest accuracy among all of the other models, however, is fairly more efficient in predicting the stressed state than the normal state.
In the case involving all of the features' groups, the accuracy (78.78%) was slightly lower than the best results obtained with a smaller number of groups.

Discussion
The present study proposes a novel approach for identifying, with an accuracy of above 80%, a stressed state in a tomato plant growing in typical production environment through the use of local signal information from the plant's electrophysiological response. More precisely, by combining electrophysiological data related to different abiotic and biotic stimuli, the proposed method offers a classifier that detects the stress commonly encountered in soilless tomato cultivation, which is referred to, in this study, as general stress. The studied stimuli are drought, nutrient deficit and infestation with spider mites.
Compared to a recent state-of-the-art methodology for classifying the plant electrical response [14], another advantage that the proposed workflow introduces is the decrease of the computing time required for extraction of the signal information used for the classification. Although both approaches rely on the same signal features, the newly proposed one, as opposed to the previous approach, includes in the extraction process the storage and reuse of the calculation from the preceding steps. Hence, when applying both of them in similar settings, the new approach performs the features extraction tasks approximately nine times faster, which strengthen its potential to be used in everyday practice. The lack of reported observations regarding the computation time required by the other related approaches in the current literature limited the feasibility to extend this comparison along the relevant state-of-the-art.
By keeping the features from the previous steps in a given sample, the new methodology allows the deliberate inclusion of historical data in the discrimination of signal patterns related to each state. The extent of the history depends on the chosen number of windows, which is also directly related to the dimension of the feature space used for building the model.
The approach for extracting the signal features allows the selection not only of the extent of the feature space but also of the length of the signal portion at which the information will be extracted. The findings presented here reveal that signal intervals of 1 min contain the most pertinent information for discriminating general stress in tomato plants caused by common stressors in soilless cultivation. By halving this interval, the discrimination power is considerably diminished, whereas for extended lengths of several minutes, the discriminative information is still present, if somewhat reduced. This observation is in line with the findings presented for identification of stress in tomato plants related predominantly to presence of spider mites, which revealed that the windows of few minutes encode information for identifying the state of the plant [14].
Discrimination between the normal and the stressed state have a tendency to improve with an increase in the number of taken windows. Nevertheless, by introducing more features, the model's complexity and, in consequence, its tendency to overfit, increases. This could be further confirmed with the noticeable absence of strict monotonous upward trends particularly for the recall value, which illustrates the correct prediction of the stressed state. Therefore, larger number of windows should not necessarily be preferred. On the other hand, smaller number of windows should be more favorable in the potential frame of applying the model in real production settings as it would require a shorter length of recordings for identifying the state of a given plant.
The presented findings show that the main information for the classification is provided by the features characterizing the temporal component of the signal. Overall, the Hjorth complexity, representing the signal curvature, appears as the most discriminative feature for the classification models. This feature has previously been identified as well as the most dominant for discrimination of both biotic and abiotic stress [9,14]. Nevertheless, its discrimination power is only displayed when combined with other features.
However, using all 34 features does not necessarily provide the most optimal classification outcome. In fact, even though the model accuracy is higher when combing several groups of features, in some cases, the ability to correctly predict both classes becomes unbalanced. On the other hand, when the model includes the main statistical measures, both relatively high accuracy and equal prediction of the two classes are observed. The minimum, maximum and the average of the signal have already been used as information for identifying the plant reaction to environmental stressors [8]. In our case, among the features included in the group of the main statistical measures, the variance appears predominant.
The realized separation between the plants forming the training from those in the test set, together with the tuning of the XGB parameters by employing the custom crossvalidation, tends to decrease the potential source of overfitting the models and, at the same time, reinforce the objectivity of the evaluation of their performance. Hence, similar results are expected when applying them on new recordings. However, broadening the range of enclosed stimuli should potentially lead to a more robust identifications of the presence of general stress in tomato plants.
Future work should also extend the analyses to other crops to order to combine the knowledge for assessing the universality of the present findings and, potentially, conceiving a common tool for assessing the health status of plants from different species. An extension enclosing a comparison with molecular and/or biochemical markers would furthermore help to better characterize the impact of different stressors. A fine assessment of the stress intensity could also bring new insights into the plant electrical response that could further improve the classification of the plant state. Data representing early stress should also be investigated for a possible characterization and identification of the stress long before the appearance of visual symptoms, which could further allow more effective and environmentally sustainable protection of the crop.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/app11125640/s1, Table S1: Accuracy of the models build with different combinations between the features' groups.