Towards a Vectorial Approach to Predict Beef Farm Performance

: Accurate livestock management can be achieved by means of predictive models. Critical factors affecting the welfare of intensive beef cattle husbandry systems can be difﬁcult to be detected, and Machine Learning appears as a promising approach to investigate the hundreds of variables and temporal patterns lying in the data. In this article, we explore the use of Genetic Programming (GP) to build a predictive model for the performance of Piemontese beef cattle farms. In particular, we investigate the use of vectorial GP, a recently developed variant of GP, that is particularly suitable to manage data in a vectorial form. The experiments conducted on the data from 2014 to 2018 conﬁrm that vectorial GP can outperform not only the standard version of GP but also a number of state-of-the-art Machine Learning methods, such as k-Nearest Neighbors, Generalized Linear Models, feed-forward Neural Networks, and long- and short-term memory Recurrent Neural Networks, both in terms of accuracy and generalizability. Moreover, the intrinsic ability of GP in performing an automatic feature selection, while generating interpretable predictive models, allows highlighting the main elements inﬂuencing the breeding performance.


Introduction
A large amount of data are nowadays collected in the livestock sector [1][2][3][4]. It is increasingly common to monitor animals, for greater accuracy in the quantity and quality of information, to achieve economic and environmental sustainability of farms. The breeder must generally deal with animals' issues, such as their health conditions and social behavior, affecting the quality of the product, the animals' welfare, and the performance of the farm. The digitization of data collection made it possible to streamline and accelerate the procedures of data collection and processing over time, permitting the registration and consequent elaboration of many additional data, going under the name of Precision Livestock Farming (PLF) [2,5]. The resulting knowledge, processed through mathematical and computational models, may provide the offset of overall incurred costs of the farm, as relevant issues are identified in advance, allowing decisions to be made in time [6][7][8].
The major consequence of a continuous monitoring of animals is a huge amount of data, the so-called "Big Data", contributing to an increase in the complexity among databases. If, on the one side, the PLF approach aims at a greater "accuracy" in the quantity and quality of information, entailing the development of monitoring systems, on the other side, it must deal with the transformation of big data into meaningful information. The increase in the amount of data requires the introduction of proper data management and prediction techniques. Machine Learning (ML) is based on the availability of large amounts of information and on computing power. Rather than making a priori assumptions and following pre-programmed algorithms, ML allows the system to learn from data. Hence, available information left previously unexploited. Indeed, data recorded in the years prior to the target year were not involved in the prediction itself, as the previously considered methods can only handle a point variables, and they were only used to determine the pool of representative farms. In fact, due to their structures, they could only exploit punctual data extracted from one year, targeting the following year. To clarify, it is not impossible to deal with past data. The sequences can be split into different observations in order to maintain the structure of a panel dataset, but the algorithms cannot detect the temporal patterns, as in this case, the observations would be treated as distinct instances [18,19]. Of necessity, the strategy entails the loss of valuable information useful to predict the corresponding target. So far, data from 2017 were used exclusively with targets on 2018. In order to properly tackle the prediction, instead of incorporating the data into a standard panel (see Table 1), in this study, we encapsulate all the values recorded over the years, for each variable, into vectors. Stated otherwise, we introduced the vectorial variables containing data from 2014 to 2017 as input, while targeting the same values in 2018. We opted for this approach since GP was recently developed as Vectorial-Genetic Programming (VE-GP), offering indeed the possibility to exploit vectors as well as scalars and looking promising as its flexibility allows for tackling many different tasks [18,19]. Consequently, we decided to investigate the usefulness of VE-GP among the breeding farms used in [16]. More specifically, we compared the VE-GP approach with Standard-Genetic Programming (ST-GP) and other state-of-the-art ML techniques, including Long Short-Term Memory (LSTM) recurrent neural networks. This study is presented here for the first time.
The article is organized as follows: In Section 2, the application background is discussed, also highlighting the main limits of the prediction methods that have been used so far. The dataset is analyzed, and the basic steps to prepare the benchmark are also described. Afterwards, the ST-and VE-GP approaches, as well as the other studied ML methods, are presented. The obtained results, achieved by all applied ML methods, are provided in Section 3, with particular emphasis on the features selected by the two GP-based methods. The experimental comparisons are discussed in Section 4. Finally, Section 5 concludes the work, also proposing ideas for further developments.

Aim and Scope
The model that is currently used estimates the number of calves born alive produced per cow per year [13][14][15]. It is a classic statistical model, formulated based on zootechnical hypotheses, and it incorporates two variables extracted from the information of the single farm: the average calving interval (intp) and the average calves mortality at birth (m), i.e., perinatal mortality: Calving and mortality detected on the farm at birth are combined through a model that provides the calf quota as a performance measure. However, it is reductive to measure breeding performance by observing only fertility and maternal conditions. As previously exposed in [12,13], gains and losses in farms are not exclusively related to the calving but are often deeply influenced by the calf development after the first 24 h following the birth. The calf, on its side, goes through evolutionary stages that depend on its own condition. The phases immediately following birth, i.e., the intake of colostrum and the healthiness of the environment in which it lives, are of paramount importance. The physiological development process of the animal reaches completion in 60 days after birth. Calf mortality is also an important cause of economic damages in Piemontese cattle farms: For the farmer, it represents the loss of the economic value of the calf and the reduction in both the herd's genetic potential and size. It is straightforward that the gestational phase alone is not exhaustive. The breeding performance should be modeled also considering neonatal mortality, outlining the calf's ability to survive, and the sources of stress such as congenital calf's defects, eventually compromising the immune response and the growth rate, as well as environmental and food conditions, that affect the quality of life of the newborn.

The Dataset
Farms exhibiting continuous visits over a reasonable period, e.g., five years, were acquired. Constant recordings between 2014 and 2019 were then considered [15,16]. As a result, farms whose activity started recently were discarded from the study, as their management still could not be completely defined. Similarly, herds resigned between 2014 and 2019 were excluded to maintain a pool of contemporary farms with comparable data. In brief, the main filters commonly imposed to select herds to work with include the following criteria: cattle farms located in Piedmont with at least 30 cows, a percentage of artificial insemination between 90% and 100% were selected, and updated visits for all the years between 2014 and 2019. Once these farms were selected, it was possible to extract the reports referred to any period in the time window, e.g., 2017-2018, or to use all the five-years information. Finally, the variable used by the ML methods as the target variable was constructed, as it was not directly available in the original dataset. Since the aim is the prediction of the number of weaned calves per cow produced annually, the actual amount was extracted for the years 2018. For each farm over all selected years, the target attribute Y was obtained with the formula below, including the values of the number of the calves born alive, those unable to survive during weaning period, and the number of cows in the corresponding year: Sorting by herd and increasing year, the general dataset has the structure shown in Table 1. The study carried out took shape from the analysis of the summary data from 2017 to build the best predictive model for the number of weaned calves per cow produced in 2018. Setting this goal, it was, therefore, necessary to manage a dataset containing input variables for each farm. Given n instances and m variables, the dataset configuration from 2017-2018 (shown in Table 2) consisted in m input scalar attributes X 17,i , where i = 1, . . . , m for each of the n farms. The number of weaned calves produced per cow in 2018 was obtained with Equation (2), which was named Y 18 in this case.
Since the results by GP did not improve by incorporating more features [15], it was more appropriate to focus on a smaller number of predictors, that can actually be reconducted to the target. As a greater number of features could become a source of noise, some variables that are actually less informative in predicting the target from an a posteriori zootechnical point of view were omitted in this study, as well as variables partially contained into other similar features. For example, in [16], both the total number of calves born and the number of births following natural impregnation were used by most GP models. The number of calves born from natural impregnation is already contained in the total number of newborns. Although it was the most frequently used variable, it may be more appropriate to keep only the total number of newborns by forcing the algorithm to use the latter variable as informative over all the considered farms (natural impregnation is not performed by all the selected herds). Prediction of target can be simpler for the algorithms if the useful information is directly provided, resulting in easier detection. However, ML methods can also find the necessary source of information if it is more complex to extract. Clearly, the task can be easily tackled if some patterns are evident over data. If the information is distributed among other features, the algorithm can detect it anyhow. On the contrary, if no hint is available, the method cannot guess the patterns as if by magic. In Table 2, the final variables are provided for the benchmark.  (2) The variables 1-19 were stored into two datasets: one containing the data referring to 2017-2018 for the standard approach (see Table 3), and the second one containing the data referring to 2014-2017 for the vectorial approach (see Table 4). In both cases, the different partitions intended for training, validation, and testing refer to the same records, sampled equally on both datasets.
. . , m, and j = 1, . . . , n. On the right side, the scalar target variable Y 18 . The division of the dataset into a learning set, subsequently divided into training and validation sets, and a test set was performed. The main idea was to extract enough learning instances in order to perform a k-fold cross validation among it, maintaining at the same time a balanced percentage between learning and test sets (70%-30%). Thereafter, as splitting strategy, 94 records were extracted to form the test set, and the remaining 210 formed the learning set. Among the latter, a 7-fold cross validation was imposed, obtaining 7 pairs of training-validation sets, consisting, respectively, of 180-30 instances. In order to perform enough runs of GP and to compare models, the technique was repeated 10 times by selecting the test instances sequentially from the main dataset, restarting from the beginning each time the last record was reached during the selection phase. The learning instances was randomly shuffled before performing the 7-fold sampling.

Standard vs. Vectorial Approaches: Genetic Programming
GP is a family of population-based Evolutionary Algorithms (EA), mimicking the process of natural evolution [20,21]. GP accomplishes a tree-based representation. The nodes contain operators, whereas the leaves (terminal nodes) are fed with operands, i.e., the features' values. As in an evolutionary biological process, the initial population evolves through the course of generations, exploiting the mechanisms of selection, mutation, and recombination of individuals. For each generation, individuals compete to reproduce offsprings. Individuals may undergo culling or survive to the next generation. As the individuals showing the best survival capabilities have the best chance to reproduce, they form elites of valuable candidates contributing to the creation of new individuals for the next generation. Offsprings are generated by a crossover mechanism, i.e., the recombination of parts of the parents, and by mutation, that is, the alteration of some of the alleles. The survival strength of an individual is measured using a fitness function, a function that computes the goodness of each individual or tentative solution.
To determine how close the prediction models came to represent the desired solution, they are awarded a score generated by evaluating the fitness function computed on the test. Each problem requires its fitness measure, and hence its proper score. When it comes to formulating a problem, defining the objective function can result as one of the most complex parts, as some requirements should be satisfied. The fitness function should be clearly defined, generating intuitive results. The user should be able to intuitively understand how the fitness score is calculated as well. In addition, it should be efficiently implemented, as it could become the bottleneck of the algorithm. When dealing with a regression problem, the choice usually falls onto the Root Mean Square Error (RMSE): where i = 1, . . . , n, and n is the number of instances. The predictor ϕ is evaluated at x i , i.e., the input variables values, and y i is the target values. A good fitness value means a small RMSE, and vice versa. RMSE is expressed in the response variable's unit, and it is an absolute measure of accuracy. The choice of this fitness function is further determined by the application of different ML techniques that build mostly non-linear models. This issue can exclude a discussion based on the coefficient of determination R 2 , as its definition assumes linearly distributed data. When the assumption is violated, R 2 can lead to misleading values [22]. The population is transformed iteratively based on the training set inside the main generational loop of a GP run. Thereafter, sub-steps are iteratively performed within each generation, until the termination criterion is satisfied. At that point, the population is evaluated on the validation set to pick the best model. At every generation, each program in the population is executed and its fitness ascertained on the training set using the proper fitness measure. By selecting, recombining, and mutating the best individuals, at each evolutionary step (i.e., each new generation) the members of the new population are, on average, fitter than the previously generated ones, i.e., they show a smaller error. Among the parameters defining the technique, the preservation of the best individual at each run is feasible, and fitness can be treated as the primary objective, whereas tree size is a secondary parameter, when ranking models. This peculiarity leads to the conservation of the most influential variables over generations. The algorithm performs, hence, an implicit feature selection and, among all the input variables, only the most relevant are encapsulated in the solutions.
ST-GP is a powerful algorithm, suitable to perform symbolic regression on any dataset. However, as many other standard techniques do, instances are treated independently, showing a potential disadvantage when dealing with sequential data. This may result in a loss of knowledge in pattern recognition of the temporal information. In addition to Recurrent Neural Networks (RNN), whose structure is suitable for managing a collection of observations at different equally spaced time intervals, Vectorial Genetic Programming (VE-GP) can manage vectorial variables representing time series [18,19,[23][24][25]. Indeed, the development of the ST-GP led to techniques exploiting terminals in the form of a vector. With this representation, all the past information associated to an entity is aggregated into a vector, giving a sense of memory and helping to keep track of what happened earlier in the sequential data. VE-GP comes with enhanced characteristics of ST-GP exploiting a proper data representation processed with suitable operators to handle vectors, reinforcing the identification ability of correlations and patterns. The target can be scalar, as well as vectorial. The technique can indeed treat both vectors, even of different lengths, and scalars together, performing both vectorial and element-wise operations.

Standard vs. Vectorial Approaches: Experimental Settings
ST-GP and other classic ML approaches were performed using the GPLab package built in MATLAB and the R library caret [26][27][28]. Correspondingly, in addition to GP, k-Nearest Neighbors (kNN), Neural Networks (NN), and Generalized Linear Models with Elastic NET regularization (GLMNET) were also tuned, based on the average performance over the validation sets. Concerning the vectorial approach, VE-GP was performed with the recent version of GPLab, introduced to handle vectorial variables [18], whereas the LSTM's comparative results were obtained with the available deep learning toolbox, implemented in MATLAB. Clearly, results were compared in terms of RMSE (3) as an error measure.
Characterized by a very simple implementation and low computational cost, the kNN algorithm is known as "lazy learning", as it does not build a model, but it is an instancebased method, exploited for both classification and regression tasks. The input consists of the k closest instances (i.e., neighbors) in the features space, and the corresponding output is the most frequent label (classification) or the mean of the output values (regression) of k nearest neighbors. Otherwise stated, in the latter case, the k nearest points are computed to predict the value of any new data point, and the values of their output is averaged to be assigned as the prediction to the given point. The number of k nearest neighbors should be chosen properly, since the predictive power can be strongly affected afterwards. A small value of k leads to overfitting, and results can be highly influenced by noise. On the contrary, a large value results in very biased models and can be computationally expensive.
A NN, usually denoted with the term of Artificial Neural Network (ANN), emulates the complex functions of the brain. An ANN is a simplified model of the structure of a biological neural network and consists of interconnected processing units organized according to a specific topology. The network is fed with features values through an input layer. Thereafter, the learning takes place among one or more hidden layer, composing the internal network. Finally, the network includes an output layer, where the prediction is given. Learning occurs by changing connections weights based on the error affecting the output. At each update, the weights of the connection between nodes are multiplied by a factor in order to prevent the weights from growing too large and the model from getting too complex.
Concerning LM, a GLMNET was preferred over standard LM. The algorithm fits generalized linear models by means of penalized maximum likelihood, combining the Lasso and Ridge regularizations, using the cyclical coordinate descent. These techniques allow one to accommodate correlation among the predictors by penalizing less informative variables: Ridge penalty shrinks the coefficients of correlated predictors towards each other, while Lasso tends to pick the most informative ones and discard the others. Compared to standard linear regression, more accurate results are usually expected from its application, as it combines feature elimination from Lasso and feature coefficient reduction from Ridge. The elastic-net penalty is controlled by the parameter α: α = 0 is pure Ridge, whereas α = 1 is pure Lasso. The overall strength of the penalty for both Ridge and Lasso is controlled by the parameter λ: The coefficients are not regularized if λ = 0. As λ increases, variables are shrunk towards zero, and they are discarded by Lasso regularization, whereas Ridge regularization includes all the variables.
One of the disadvantages of an ANN is that it cannot capture sequential information in the input data. An ANN can deal with fixed-size input data, that is, all the item features feed the network at the same time, such that there is no time interval between the data features. When dealing with sequential data, in which there are strong dependencies between the data features, i.e., in text or speech signals, a basic ANN is not able to properly address the task. In this regard, basic ANNs were developed to make way for a more efficient algorithm, particularly useful for time series. RNN is a type of ANN that has a recurring connection to itself. The gap between information may become very large, and the amount of sequential information can be complex to retain. As that gap grows, RNNs lose their ability to learn connections. To overcome the short-term memory weakness, (LSTM) architecture was designed to solve this problem with RNNs. By means of internal mechanisms, they keep track of the dependencies between the input sequences, storing and removing unnecessary information. The LSTM introduces the concept of cell states. By using special neurons called "gates" placed in the cell state, LSTMs can remember or forget information. Three kinds of gates are available inside the cell, in order to filter information from previous inputs (forget gate), to decide what new information to remember (input gate), and to decide which part of the cell state to output (output gate). These gates are a sort of highway for the gradient to flow backwards through time.
In Tables 5 and 6, the final optimal parameters are summarized. Regarding ST-GP, we provided the algorithm with a set of primitives F composed of {plus; minus; times; mydivide}, where plus, minus, and times indicate the usual operators of binary addition, subtraction, and multiplication, respectively, while mydivide represents the protected division, which returns the numerator when the denominator is equal to zero. Likewise, we chose proper functions for VE-GP. Differently from ST-GP, suitable functions are indeed provided to manage scalar and vectors [18,19]. For the considered problem, we used {VSUMW; V_W; VprW; VdivW; V_mean; V_min; V_meanpq; V_minpq}. The first four operators represent the elementwise sum, difference, product, and the protected division between two vectors or between a scalar and a vector, respectively, e.g., VSUMW([2,3.5,4,1],[1,0,1,2.5] = [3,3.5,5,3.5]). The mean and minimum of a vector return the corresponding value for the whole vector (standard aggregate functions V_mean and V_min) or for a selected range [p, q] inside the vector, where p and q are positive integers with 0 < p ≤ q (parametric aggregate functions V_meanpq and V_minpq), e.g., V_mean([2,3.5,4,1]) = 2.6, whereas V_mean 3,4 ([2,3.5,4,1]) = 2.5. The fact that standard and parametric aggregate functions collapse the vectorial variable into a single value allows one to handle all the information contained in the vector or part of it. In addition to crossover and mutation, the algorithm is provided with an operator reserved for the mutation of the aggregate function parameters. It allows p and q to evolve in order to detect the most informative window in which to apply thereafter the aggregate function. The set of terminals was composed of the predictors in Table 2 for both ST-and VE-GP.

ST-GP vs. VE-GP
ST-GP and VE-GP were first compared, in order to analyze the behavior of the two algorithms. In Figure 1, the median fitness evolution is plotted, based on the following procedure. For each fold within the learning set, a model was selected according to its performance over the validation set. Hence, after seven runs of the GP, seven models were available, i.e., the ones showing the lowest fitness among the validation. All seven best drawn models were evaluated on the whole learning set and the test set, and the median of the seven models was stored. As the 7-fold was repeated 10 times, 10 median trends were available at the end of the entire evolutionary process. The plot shows the median behavior of the 10 median fitness achieved for each generation. We initially decided to run the two algorithms for 100 generations. The choice of stopping the evolution after 40 generations was dictated by the overfitting trend recorded among ST-GP. On the contrary, VE-GP proved to be more stable than ST-GP, at least as far as we ran 100 generations. Moreover, the median fitness was overall lower, showing that GP is affected by a remarkable improvement of such a problem, if temporal information is added, together with proper functions. The VE-GP models outperformed the ST-GP ones, stabilizing at lower errors. We analyzed the predictors encapsulated in the final models by both ST-and VE-GP, selected with respect to the performance achieved on the test sets by running the algorithms for 40 generations. Table 7 shows that both methods used nine variables to tackle the target. However, the same predictors were not used and, above all, not with the same frequency. The number of COWS, for example, was highly exploited by both GP algorithms, but all the VE-GP models based the prediction on this feature, whereas only 70% of the ST-GP models found it to be informative. C PAR , on the other hand, was used only by the ST-GP and in 50% of the solutions, and N BALIVE was involved in 60% of them. N TOT was rather exploited only by the VE-GP and in 80% of the models. It is evident that as long as GP is run to predict the target based on the information of a single year, patterns are more difficult to be found, and the algorithm (ST-GP) tries to solve the problem by extracting as much information as possible from as many features as possible (7 variables out of 18 were used in more than 20% and at most in 70% of the solutions). When providing temporal information, the search was easier for GP, whose models achieved better fitness, detecting mainly the information based on a few predictors (4 out of 18 were exploited in more than the 30% of solutions, and among the four features, 1 was handled by all the models). Even considering the variables used by each model (Table 8), on average, 8.4 predictors were used by the ST-GP (from 6 to 15), whereas the VE-GP built models exploiting 5.5 features on average (from 3 to 9). Table 7. Frequency of use of each variable among the best 10 individuals found by ST-GP (left column) and VE-GP (right column).

Comparisons of ST-GP and VE-GP with Other ML Methods
Now, we discuss the results of the experimental comparison between the ST-GP, VE-GP, and the other ML methods presented previously. As already explained, in addition to ST-GP, KNN, NN, and GLMNET also exploited the information on 2017 with a target in 2018, whereas LSTM was involved in the VE-GP to process vectorial variables for 2014-2017 and the 2018 target. The results reported in Section 3.1 for the ST-GP compared to the VE-GP are also supported by the corresponding boxplots in Figure 2. The median and mean RMSE values are reported in Table 9.  The Kruskal-Wallis nonparametric test, performed for all the considered methods with a significance level of α = 0.05, was applied to investigate the RMSE achieved among the learning sets and the test sets separately. The resulting p-values (t 0.001) showed extremely significant differences in median performance between the methods, considering both stages. The pairwise Wilcoxon tests provided with Bonferroni correction α = 0.05/15 = 0.0033 was hence performed among all compared techniques. Among the learning set, STGP was significantly different from all other methods, resulting in a poor performance. Likewise, KNN was significantly different with respect to both NN and LSTM, as well as the comparison between VEGP and LSTM. Concerning the RMSE achieved on the test sets, STGP performed poorly with respect to other methods, showing greater, significantly different values for the RMSE on average. On the contrary, GLMNET's performance was significantly better than KNN's and NN's, as well as LSTM's compared to KNN and VEGP, respectively. Consequently, the following pairs of methods did not show significantly different performance: VEGP-KNN, VEGP-NN, VEGP-GLMNET, VEGP-LSTM, as well as the pair LSTM and GLMNET.
As a further study, we also compared learning and test fitness distributions obtained by the single methods in order to determine the occurrence of overfitting. The Wilcoxon signed rank test showed that the two distributions for KNN and the two obtained with NN were different, since the corresponding p-values were extremely significant (Wilcoxon: p 0.001), as well as the median RMSE (Kruskal-Wallis test: p 0.001). Concerning the ST-GP, the two distributions and the median error were slightly different (Wilcoxon and Kruskal-Wallis: p-values equal to 0.048 and 0.034, respectively). GLMNET showed the same learning and test fitness distributions but different median RMSE (Wilcoxon: p > 0.05; Kruskal-Wallis: p = 0.041), whereas LSTM achieved different distributions with similar median. VE-GP was the only method that produced the same fitness distributions with the same median among the learning and test sets.

Discussion
Considering all the results of the statistical tests, ST-GP produced less accurate models, and all the other methods outperformed ST-GP. However, among the different techniques, KNN and NN clearly overfitted, generating good results on training data but losing their ability to generalize on unseen data. On the contrary, VE-GP, GLMNET, and LSTM produced better and statistically similar results, as the RMSEs on the test set were not significantly different across these methods. In particular, LSTM produced the best fitness considering both learning and test sets. However, VE-GP was the only method showing the same distribution among learning and test sets, highlighting its ability to generalize better over unseen data. These outcomes are a clear confirmation of the importance of introducing the temporal information in the form of vectors for the studied problem.
The study was dedicated to the inspection of GP behavior when predicting a target starting from datasets that, in one case, were exclusively formed by scalar values (treated hence with ST-GP) and, in the other, assumed a vector representation (handled with VE-GP). This representation is quite useful for incorporating temporal patterns or, in general, successive collections of data for single variables among the same candidate. Indeed, with the common representation through standard data frames, such patterns are usually not recognizable, and the performance of the models do not improve. On the contrary, if the data are organized into a vectorial dataset, the algorithm receives temporal information in input. Thereby, by means of proper functions able to manage vectors, it can produce more accurate predictions. First, the dataset was prepared to deal with the vector-based representation. The datasets, sharing the same scalar target from 2018 (i.e., the quota of weaned calves per cow per year) were prepared by extracting the data among 2017 and among the whole period from 2014-2017, based on a previously defined set of farms. In this study, a different splitting rule was defined among the datasets with respect to previous investigations. The learning and test sets were selected respecting the proportion of 70%-30%, and thereafter, learning sets, randomly reshuffled, were split according to a 7-fold cross validation technique. Prediction models were constructed with different GP algorithms, ST-and VE-GP first, that were thereafter compared with other ML methods. VE-GP was compared with LSTM, which considers the time relationship among the data.
The main goal was hence to inspect the ability of VE-GP with respect to ST-GP in predicting the target. The developed algorithm could produce better results by achieving lower RMSEs among both learning and test sets. We first analyzed the evolution of the median fitness observed on the learning and test sets, and clearly, VE-GP proved to be more stable, evolving a population through more generations without giving sign of overfitting, whereas ST-GP showed the "symptom" quite soon, considering similar experimental settings. In addition, VE-GP reached better results by encapsulating fewer variables in each extracted candidate model, detecting the information to a greater extent from specific features. VE-GP still gave access to the formula and to the features implicitly selected, providing meaningful information about the tackled issue. Being able to extract important features among the predictors in form of vectors, the algorithm improved the target forecast. VE-GP turned out to outperform not only ST-GP but also other techniques used in the field of ML. Although VE-GP performed similarly to LSTM and GLMNET (the latter exploiting the standard data representation), it was the only method that did not show a significantly different behavior on the learning and test sets. The two distributions and their median are similar, entailing that VE-GP provides a good response in terms of generalization ability on unseen data. Improvements can be expected by feeding the algorithm with larger datasets by providing more candidates and longer vectors.

Conclusions
Exploring the vectorial approach required, as already stated, a different input data structure. To this purpose, the farms considered in the pool of instances were the same as in [16]. However, since the results showed that GP exploited only certain variables, the number of predictors was reduced to 18. In this way, possible noise due to extra variables, which were not very informative, was avoided. The main goal was to inspect the ability of Vectorial Genetic Programming (VE-GP) with respect to ST-GP to predict the target. The recently developed VE-GP algorithm could produce better results, by achieving a better fitness on both the learning and test sets. VE-GP proved to be more stable, evolving a population through more generations without showing overfitting, while Standard Genetic Programming (ST-GP), was affected by overfitting already in the early generations under analogous experimental settings. VE-GP still favored model investigation by giving access to the formula and hence to the features implicitly selected, providing meaningful information about the tackled issue. Better results were obtained by encapsulating fewer variables in each extracted candidate model, detecting almost all the information among specific features. The algorithm improved the target forecast, proving to outperform not only ST-GP, but also other techniques used in the field of ML. The algorithm, in particular, was compared to Long Short-Term Memory Recurrent Neural Network (LSTM), suitable for handling vectorial predictors. Even though performing similar to the LSTM and Generalized Linear Models (GLMNET), the latter exploiting the standard data panel representation, VE-GP was the only method entailing a greater ability in generalization over unseen data.
The introduction of vectorial variables produced a significant improvement over the accuracy of the result. Evolution was also much more stable, and the ability of the algorithm to handle any type of variable, both scalar and vectorial, makes it quite a flexible tool. These considerations open the possibility of providing more complex datasets, containing different types of sequential features. The possibility of managing vectorial variables, whose values can be of different types and have no fixed length among the whole dataset, push the analysis beyond the basic research conducted at this point. On the one side, both categorical and continuous variables can be treated simultaneously, without specifying it explicitly to the algorithm: The latter is indeed able to process them during the evolution without hints given by the user. On the other hand, when dealing with vectors, some data may be not available. Thus, the vector variables may have different lengths, even being scalars when the latter is equal to one. VE-GP is suitable to manage all kinds of features dynamically, combining them in the prediction of the target. Evolutionary algorithms can be applied to zootechnical data, achieving performing models able to learn from all the available data. In this case study, the breeding variables in the report extracted at the end of the year were used. In one case, they were managed for only one year; in the other four, the average values, corresponding to four years, were used, proving to be more suitable for reducing the prediction and generalization errors. Instead of limiting the analysis to the year-end average, it might be more useful to incorporate the data collected from each farm visit into a vector representation. As a result, all variations, even small ones, would be available, and the algorithm could identify temporal patterns that were not visible by directly processing the average value for the whole year.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: