Domain Adaptation and Federated Learning for Ultrasonic Monitoring of Beer Fermentation

: Beer fermentation processes are traditionally monitored through sampling and off-line wort density measurements. In-line and on-line sensors would provide real-time data on the fermentation progress whilst minimising human involvement, enabling identiﬁcation of lagging fermentations or prediction of ethanol production end points. Ultrasonic sensors have previously been used for in-line and on-line fermentation monitoring and are increasingly being combined with machine learning models to interpret the sensor measurements. However, fermentation processes typically last many days and so impose a signiﬁcant time investment to collect data from a sufﬁcient number of batches for machine learning model training. This expenditure of effort must be multiplied if different fermentation processes must be monitored, such as varying formulations in craft breweries. In this work, three methodologies are evaluated to use previously collected ultrasonic sensor data from laboratory scale fermentations to improve machine learning model accuracy on an industrial scale fermentation process. These methodologies include training models on both domains simultaneously, training models in a federated learning strategy to preserve data privacy, and ﬁne-tuning the best performing models on the industrial scale data. All methodologies provided increased prediction accuracy compared with training based solely on the industrial fermentation data. The federated learning methodology performed best, achieving higher accuracy for 14 out of 16 machine learning tasks compared with the base case model.


Introduction
Beer is one of the world's oldest and most widely consumed alcoholic beverages [1]. Beer fermentation processes are conventionally monitored through sampling and off-line wort density measurements [2]. This method is typically performed every couple of hours, requires manual operation, is time-consuming, and does not produce real-time results [3]. Automatic acquisition of real-time data pertaining to the fermenting wort would enable accurate process end point determination and identification of lagging fermentations. This would provide benefits of improved product consistency, fewer lost batches, time savings, and environmental benefits of less waste and less resource and energy use [3]. This can be achieved through in-line and on-line sensing techniques, where in-line methods directly measure properties of the fermenting wort and on-line methods use bypasses to automatically collect, analyse, and return samples to the vessel [4]. Furthermore, manufacturing is undergoing the fourth industrial revolution, where industrial digital technologies such as the Internet of Things (IoT), cloud computing and Machine Learning (ML) are implemented to integrate not only entire processes but also markets and supply chains [5]. This has the potential to increase the efficiency, productivity, product quality, and flexibility of manufacturing processes [5]. In-line and on-line sensors underpin this transformation by collecting the real-time data to provide automatic decision-making and minimise human involvement [6]. Several in-line and on-line methods to monitor alcoholic fermentation have been investigated, such as near-infrared spectroscopy [3,7], Raman spectroscopy [8,9], mid-infrared spectroscopy [10], Fourier transform infrared spectroscopy [11], MEMS resonators [12], CO 2 emission monitoring [13], and ultrasonic (US) sensors [14][15][16][17][18]. Typically, these techniques use calibration techniques to correlate sensor data to material composition across the full range of process conditions (e.g., temperature) [3]. Conversely, ML can be used to map sensor data directly to target variables (such as classifying the stage of the fermentation process or predicting the time remaining until significant process milestones) without requiring extensive calibration procedures. Moreover, ML is able to fit complex non-linear relationships between multiple variables, or features, extracted from sensor readings. Furthermore, validation procedures encourage the development of models which accurately predict when process parameters are outside of the range they were trained on. Ultrasonic sensors have benefits of being low-cost, are non-invasive, small in size, have low energy consumption, and are able to characterise opaque materials. ML has previously been combined with US sensors to monitor fermentation processes. Hussein et al., (2012) used the US velocity, process temperature, and nine signal features extracted from the time and frequency domains to predict wort density using an artificial neural network [14].  inputted time domain signal features into Long Short-Term Memory (LSTM) neural networks to predict the volume of alcohol percentage throughout fermentation [18].
ML methods require sufficient volumes of data for model training. However, fermentation processes can last for many days, imposing a significant time investment for data collection. Therefore, industrial fermentation monitoring using sensors and ML would benefit from using knowledge gained from previously monitored fermentation processes whether conducted in a laboratory or from other breweries. This would be of particular benefit to the growing craft breweries industry, where a wider range of beers are produced at smaller volumes, necessitating ML models which can be trained on fewer fermentation batches whilst being robust across different formulations of beer [19,20]. However, US sensor readings acquired from different fermentation vessels (different domains) present different data distributions to the ML models [21]. This can be due to differing US sensor contact between the two vessels, a difference in vessel construction affecting US waveform propagation, or differing waveform frequency distributions produced by the sensors [21]. Therefore, even for a similar fermentation task, the ML model trained on the source domain data will perform poorly when asked to make a prediction based on the target domain data. Domain adaptation is a subcategory of transfer learning which alters how the ML model is trained to predict accurately across both domains [22]. Unlabeled domain adaptation techniques can be used for tasks with no reference measurement available in the target domain to correlate input features to output variables during ML model training [21]. Conversely, labelled domain adaptation can be used for tasks where a reference measurement is obtainable. Common unlabeled domain adaptation techniques include minimising the distance between features from different domains using metrics such as the Maximum Mean Discrepancy [21,[23][24][25][26][27], adversarial methods to confuse domain membership classifiers [28][29][30][31][32], generative methods to transform domain features [33][34][35][36], or Adaptive Batch Normalisation, which aligns the feature distributions across the domains for each batch [37,38]. Labelled domain adaptation can be achieved through either pre-training on the source domain and fine-tuning on the target domain, retraining the last few layers of a network using the target domain data, or by training using the data from both domains simultaneously [39]. While training ML models across fermentation processes from multiple breweries, the companies may not wish to share the US sensor data which could reveal information about their product formulation or process control strategies. In this case, federated learning may be used to share network weights from local models trained on an individual brewery's data to update a common global model as opposed to transferring the acquired sensor data and thus maintain privacy [40].
In this work, US sensor data acquired from a laboratory fermentation process is used to aid ML prediction on an industrial scale fermentation task. The industrial scale fermentations were monitored at a Small and Medium-sized Enterprise (SME) company, and so the data is of limited volume. Therefore, the laboratory scale dataset is used to improve ML model accuracy on these limited number of batches. The models are trained as multi-task networks to predict four outputs: classification of whether ethanol production has started, classification of whether ethanol production has ended, the time remaining until ethanol production begins, and the time remaining until ethanol production ends. Rather than using US sensor data to predict the wort density or alcohol by volume, this methodology directly predicts the most important information required from the fermentation process: whether the fermentation is lagging and determination of the fermentation end point.
Three domain adaptation methodologies are investigated. Firstly, labelled domain adaptation is used to simultaneously train the models on data from both domains. Simultaneous training on both domains is used as opposed to pre-training on the laboratory scale data and fine-tuning on the industrial scale data or retraining the last few layers of the network which are usually used for training convolutional layers in transfer learning for image recognition tasks. This is because, unlike convolutional filters which can detect features compared to a background of neighbouring pixels, the differences in feature magnitudes and trajectories in this work mean that features extracted in the source domain would not transfer to the target domain and the network would undergo catastrophic forgetting [41]. Secondly, the networks are also trained in a federated learning strategy to evaluate the impact of privacy preservation on ML model accuracy. Lastly, fine-tuning of the best performing models which have been trained on the source and target domains simultaneously are investigated again.

Materials and Methods
Two sets of fermentations were monitored: one in a 30 L laboratory scale vessel at the University of Nottingham and the second in a 2000 L industrial scale fermenter at the Totally Brewed brewery in Nottingham, UK. Full experimental details for the laboratory scale fermentations are included in [18]. The laboratory scale dataset consisted of 13 fermentations and the industrial scale dataset consisted of 5 fermentations. For the laboratory scale dataset, the same type and quantity of malt (Coopers Real Ale, Adelaide, Australia), yeast (Coopers Real Ale, Adelaide, Australia), sugar (brewing sugar, the Home Brew Shop, Farnborough, UK) and water (22 L) were used for all fermentations. For the industrial scale dataset, three different beers were monitored: three fermentations consisting of Slap in the Face, one Guardian of the Forest, and one 4 Hopmen of the Apocalypse. The same US probe was used to monitor both the laboratory and industrial scale fermentation processes ( Figure 1). The US probe contained a US transducer (Sonatest, 2 MHz central frequency, Milton Keynes, UK) and a temperature sensor (RTD, PT1000, RS Components, Corby, UK). The US transducer was connected to a Lecouer Electronique US Box (Chuelles, France) that provided the excitation pulse to the transducer and digitised the received US signal. The temperature sensor was connected to a Pico electronic box (PT-104 Data Logger, Pico Technology, St Neots, UK). The two electronic boxes were connected to a laptop that controlled the data acquisition. Coupling gel was applied between the US transducer and the probe material, and a spring maintained the contact pressure. For the laboratory scale fermentations, a Tilt hydrometer provided real-time density measurements as a reference measurement of the fermentation progress and to provide labelled data for ML model training. For the industrial scale fermentations, samples were removed every two hours (except during night-time) and the wort density was measured using a hydrometer. For the industrial scale fermentations only, the temperature was decreased once the desired wort density was reached. Blocks of US and temperature data were collected periodically. Each of the blocks consisted of 36 US waveforms and 36 temperature readings. The US signal consisted of 7000 sampling points at 80 MHz sampling frequency. The time between each waveform acquisition was 0.55 s. Between each block of data collected, 200 s elapsed. As depicted in Figure 1, the US transducer emitted sound waves which travelled along the PMMA probe material. At the interface between the probe material and the wort, a portion of the sound wave was reflected and the rest continued through the fermenting wort. Part of the reflected sound wave travelled through the probe-couplant boundary and was received by the transducer (the first reflection) whilst some reflected from this interface and repeated the previously described path (the second reflection). Therefore, the second reflection is a reverberation of the first reflection's path. The portion that passed through the fermenting wort was reflected at the opposite probe wall and travelled back to the transducer (the third reflection). An example of the US waveform recorded by the transducer is presented in Figure 2a. Each of the reflections in isolation are presented in Figure 2b-d. The start of the waveform (sample points <1000 in Figure 2a) was reflected back to the transducer before it contacted the probe-wort interface and therefore contains no useful information about the fermentation.

Ultrasonic Waveform Features
In total, 14 US waveform features were inputted into the ML models. Explanation of the calculation method and justification of the feature choices are provided in the following sections. In addition to the US waveform features, the process temperature was also used as an input. Although US sensors can accurately monitor fermentations without inclusion of the temperature as a feature [18], temperature sensors are already installed on most industrial vessels. As such, this data can be exploited in the ML models with no further effort in sensor installation or data collection.

Energy
The waveform energy is a measure of the total magnitude of the sound wave received by the transducer during an enveloped period. For the first reflection, this is a measure of the proportion of the sound wave reflected from the probe-wort interface and provides a measure of the changing wort density. Similarly, the energy of the second reflection is also dependent on the density of the fermenting wort in contact with the probe material. The energy of the third reflection is dependent on the previously discussed probe-wort boundary, the far wort-probe boundary, sound wave attenuation in the wort through which it travels, and the level of sound wave attenuation caused by CO 2 bubbles present in the wort [42].
where E is the waveform energy, Ai is the waveform amplitude at sample point i, and start and end denote the range of samples points for the reflection of interest [43]. The waveform energy was the only feature selected from the oscillating part of the US waveform. Other features are commonly extracted to be used as ML model inputs, e.g., the peak-to-peak amplitude, maximum amplitude, minimum amplitude, skewness, kurtosis, and standard deviation [18,21]. However, previous work performing domain adaptation with US waveforms has shown that these additional features are unlikely to follow the same trend in both domains and their inclusion will degrade ML accuracy [21]. Therefore, only the waveform energy is used in this work as it is a measure of physical changes in the monitored wort.

Energy Standard Deviation
The standard deviation in the waveform energy was calculated across the 36 US waveforms obtained during each acquisition block. As CO 2 bubbles may be present in the wort through which the 3rd reflection travels, or on the probe surface affecting the 1st and 2nd reflections, the energy standard deviation monitors CO 2 formation throughout fermentation.
where STD is the standard deviation, W is the number of waveforms collected in the block, i is an individual waveform, and E is the mean waveform energy in the block.

Time of Flight
The time of flight was calculated using three different methods to overcome the noise and low amplitude signals present in the acquired US waveforms. Firstly, a thresholding method identified the earliest waveform sample point that rises above a predetermined value, and was calculated for all three reflections. A zero-crossing method identified the sample point where the waveform crosses zero after the threshold value had been reached, and this was also calculated for all three reflections. Finally, an auto-correlation method identified the sample point where the correlation between the first reflection and the subsequent reflections are determined to be most similar. The time of flight is a measure of the speed of sound through the materials, i.e., the probe material for the first and second reflections (dependent on the temperature of the material) and the wort for the third reflection (dependent on wort temperature and density) [44].

Machine Learning
Multi-task deep neural networks consisting of a fully connected layer followed by an LSTM layer were used for all ML tasks. A summary of the three domain adaptation methods used is provided in Table 1. The fully connected layer enabled the creation of new features that are similar across both domains from combinations of the original inputs. The LSTM layer learns the trajectories of these modified features. The multi-task models were trained to simultaneously predict whether the production of ethanol had begun (classification), whether the production of ethanol had ended (classification), the time remaining until the start of ethanol production (regression), and the time remaining until ethanol production finishes (regression). In an industrial environment, this would provide benefits of identifying lagging fermentations by monitoring the start of ethanol production and estimating process end times by monitoring when ethanol production was complete. Multi-task learning is advantageous as it can allow for more effective process learning in the ML model when multiple metrics are desired whilst reducing the redundant information being stored [45]. Furthermore, multi-task learning is likely to reduce overfitting by preventing a single task from dominating the learning process.
LSTM layers in neural networks are able to retain information from previous timesteps in a sequence. LSTMs are a type of recurrent neural network that reduces the likelihood of vanishing or exploding gradients by using gate units. This enables their use over much longer sequences [46]. Zero-padding was applied to the US features to make every fermentation sequence equal to the maximum sequence length of 1556 timesteps. A masking layer designated that the LSTM units ignore this padding. All timesteps for each fermentation were used as a single sequence rather than being truncated into multiple sequences of shorter length. While long sequences (250-500 timesteps) are prone to producing vanishing gradients in LSTM layers when predicting a single output, this is not a concern when predicting an output at every timestep, as used in this work [47]. The input features from each dataset were independently normalised so that every feature ranged between 0 and 1 for both domains. This step aids domain adaptation capability by aligning the feature distributions from both domains, and is similar to the methodology used in [21].
A k-fold cross-validation procedure determined the optimal batch size, number of neurons in the fully connected layer, number of LSTM units, learning rate, L2 regularisation penalty, and number of epochs. As five industrial fermentation batches were monitored, the number of these fermentations used in the training set ranged from one to four, corresponding with the number of fermentations in the test set ranging from four to one (Table 2). Therefore, k was determined by the number of industrial fermentations present in the training set. For example, if only one fermentation was used in the training set, no cross-validation could be performed. However, when four fermentations were used, fourfold cross-validation was performed ( Table 2).
The Adam optimisation algorithm and a gradient norm clipping value of 1 was used to reduce the likelihood of exploding gradients. The order of the training sets was shuffled after every epoch. The regression losses (mean squared error, Equation (3)) were multiplied by 0.1 to ensure their magnitudes were similar to the classification losses (binary crossentropy, Equation (4)). This aided the network in learning both the classification and regression tasks. After cross-validation, the optimal hyperparameters which resulted in the lowest average validation error were used to train a final model using the entire training set. The networks were trained using TensorFlow 2.3.0. The coefficient of determination (R2), mean squared error (MSE), and mean absolute error (MAE) were used as performance metrics to evaluate the regression tasks during cross-validation. The accuracy, precision, and recall were used to evaluate the classification tasks during cross-validation. Evaluating multiple metrics provides a comprehensive assessment of a model's ability to fit to the validation and test sets and facilitates improved comparison between models. In the results section, only the MAE and accuracy are discussed to aid clarity.
where BCE is the binary cross-entropy loss, MSE is the mean squared error loss, N is the number of samples, y is the target variable andŷ is the predicted value.  For the networks trained on both datasets simultaneously, the impact of dropout on the domain adaptation performance was evaluated. Dropout layers randomly remove neurons and their connections during training according to the designated probability [48]. Thus "thinned" networks are trained during each training batch encouraging more propagation paths through the network to be learned. Two dropout layers are used, one after the input layer and before the fully connected layer, and one after the fully connected layer and before the LSTM layer. The dropout layer probabilities were set to 0 or 0.5, producing four parameter combinations. Dropout was used to investigate whether it aided domain mixing in the network rather than certain neurons only learning a single domain and the remaining neurons co-adapting. There were more fermentation batches in the laboratory scale dataset compared to the industrial scale dataset. As such, to ensure both domains were learned, the frequency of the industrial dataset in the training set was increased. For example, when a single industrial fermentation batch was present in the training set, this was passed to the network 13 times during one epoch. Similarly, when four industrial fermentation batches were present, each was used three times during training for each epoch ( Table 2). For the federated learning investigations, local models were trained on each dataset and a weighting factor was applied to the resulting local network weights before being summed to produce a global model. The global model weights were used as the initialisation weights for the next epoch of local network training. After training, the global model was evaluated on the test set. The weighting factors were changed depending on the number of industrial fermentation runs present in the training set. I.e., 0.9 for the industrial scale data local model and 0.1 for the laboratory scale model when a single industrial fermentation run was present in the training data, and 0.75 and 0.25 when four industrial fermentation runs were used in the training data (Table 2).
Finally, fine-tuning the best performing models on the target domain data was assessed. As the models are used to monitor the industrial scale fermentations, the final models do not need to be accurate on the source domain laboratory scale fermentations. Therefore, after initial training to transfer knowledge from the source domain, fine-tuning on the target domain can increase model accuracy of the industrial scale data. All network weights were tuned. Preliminary investigations froze the model weights for the fully connected and LSTM layers and only tuned the output layers. However, this resulted in lower accuracy models on the validation sets than when all weights could be updated.
These domain adaptation methodologies are compared with a model trained only on the industrial scale fermentation data, i.e., without using the laboratory scale data or domain adaptation. This is named the No DA model and is used as a base-case comparison.

Ultrasonic Measurements
Figure 3a-f displays the US feature and temperature results for the industrial scale fermentations. Full discussion of the US feature and temperature results for the laboratory dataset are included in [18]. A comparison between the two datasets is provided in the text. For the industrial scale dataset, the process temperature was decreased after the desired wort density had been reached, determined through off-line sampling and hydrometer measurements. As such, Figure 3b-f display the results until one day after the temperature was decreased so that the US feature changes during ethanol production are clearly presented. The results show that the time of flight for the third reflection decreased, corresponding to an increase in the speed of sound, during ethanol production for all fermentations (Figure 3f). This agrees with [14,15] but contradicts the results found in [16,17,49] which monitored a decreasing speed of sound throughout fermentation. The reason for this is likely because [14,15] monitored an industrial fermentation process, similar to the industrial scale dataset in this work, whereas [16,17,49] monitored a small laboratory scale process (250 cm 3 ). Therefore, the specific combination of water, ethanol, sugar, yeast, and CO 2 concentrations present in industrial processes may produce an increasing speed of sound during ethanol production. Overall, the energy of the first reflection increases during ethanol production (Figure 3c), as found in [18]. This indicates an increase in acoustic impedance mismatch at the probe-wort interface. As the acoustic impedance is a product of the material density and speed of sound, this shows that the decreasing wort density has a larger impact than the increasing speed of sound on the wort acoustic impedance [42]. The energy of the third reflection shows no general trend during ethanol production (Figure 3d) indicating that the reduced sound wave proportion travelling through the first buffer-wort interface is offset by the increased sound wave reflection at the far wort-buffer interface. The third reflection energy displays increased variation over the first reflection energy due to sound wave attenuation in the presence of CO 2 bubbles, similar to the results found in [17,18]. In contrast, the laboratory scale data shows no trend in the speed of sound during fermentation and the third reflection energy follows a similar profile to the first reflection [18]. This is likely due to these effects being masked due to the varying temperature during ethanol production for the laboratory scale dataset, whereas the temperature was controlled during this period for the industrial fermentations. Figure 4 displays the first reflection energy for the first five fermentations from the laboratory dataset. The differing feature magnitudes and trajectories compared with Figure 3c showcases the need for domain adaptation techniques. Figure 5a,c,e and Figure 5b,d,f display the classification accuracies for the beginning of ethanol production and end of ethanol production for the trained networks, respectively. Although the multi-task networks were also trained to predict the time remaining until (and had passed since) the start and end of ethanol production, the regression predictions are most useful close to the classification boundaries. For example, an accurate prediction of the time since ethanol production started is not needed near the end of the fermentation process, or an approximate time for when ethanol production will end would not be useful when the fermentation is lagging and never begins. Therefore, the classification results are most valuable when evaluating the utility of the trained model. Furthermore, due to the multi-task nature of the model, the accuracy of the classification results correlates with the ability to learn the regression tasks close to the classification boundaries. As such, only the classification results are included in the presented graphs. However, the regression accuracies are presented in Table 3 and discussed in the text.   Figure 5a,b display the results for the networks which were trained on the source and target domain data simultaneously. Preliminary investigations determined that the 0.5, 0.5 dropout rate models failed to train accurately for all training set sizes. Models with 0.5, 0 dropout rates produced inconsistent results, with some models accurately predicting using the test set data and some models performing worse than the model trained on only the industrial scale fermentations (No DA). However, the 0, 0 and 0, 0.5 models achieved higher accuracy than the No DA model for six out of eight classification tasks. Furthermore, the 0, 0 model achieved lower MAE for seven out of eight regression tasks compared to the No DA model. Therefore, the 0, 0 and 0, 0.5 dropout rates were used for subsequent investigations and the results of these models are presented in Figure 5a-f and Table 3. These higher accuracy results for the domain adaptation models prove that using the laboratory scale data to train the networks benefits the predictions on the industrial scale dataset. Figure 5c,d display results for the models trained in a federated learning strategy. The two federated models are trained using the best performing dropout probabilities determined from the previous investigation and are compared with the No DA baseline results. The 0, 0 model achieved higher classification accuracies and lower MAE for six out of eight classification and regression tasks than the No DA model. When using four industrial scale fermentation batches in the training set, the 0, 0 model reached accuracies of 99.8% and 99.9% for predicting the start and end of ethanol production, respectively. Furthermore, the 0, 0.5 models achieved better results for seven out of eight of the classification and regression tasks. Overall, the federated learning models were more accurate than their corresponding non-federated training models using the same dropout probabilities, achieving higher classification accuracies on eight tasks compared to seven for the non-federated learning models. Similarly, the federated learning models achieved lower MAEs on 10 regression tasks compared with five for the non-federated learning models. This is an encouraging result as it indicates that not only can federated training provide benefits over models that train without the laboratory scale data, but that they can also perform better than conventionally trained domain adaptation networks in addition to maintaining data privacy. The reason for this may be the increased model learning afforded in the industrial scale dataset local model. During training, this model learns from an epoch full of the industrial scale training dataset compared with the non-federated model which only learns from the industrial scale target domain intermittently between source domain fermentation runs. This increased learning without switching between domains may allow the network weights to travel further towards local optima for the industrial scale dataset in each epoch. This contrasts with results presented in the wider literature, where federated learning degraded model accuracy compared with non-federated learning by 3.3% [50], 1.66% [51], and <10% [52]. Figure 5e,f display the classification results for the previously discussed federated models fine-tuned on the industrial dataset. While still providing improvements over the No DA base case, achieving higher classification accuracies for 12 out of 16 tasks, their accuracy is reduced over the starting federated learning models. This is most likely due to the fine-tuning method overfitting during training. The reason for this is the large network size required to learn both domains in the starting models. For example, the No DA models had a maximum optimum number of eight neurons in the fully connected layer and four LSTM units to learn only the target domain. However, the federated learning models required a maximum of 128 neurons in the fully connected layer and eight LSTM units to fit to both dataset domains. Therefore, when fine-tuning on the industrial dataset after fitting to both domains, the model begins to overfit, especially when four industrial batches are used in the training set. Table 3. The regression accuracies of each of the models for predicting the time remaining until the start and end of ethanol production, where MAE is the Mean Absolute Error of the prediction. The base-line model was trained using only data from the industrial fermentations. The numbers in the Model column indicate the dropout probability used in each dropout layer. E.g., 0,0 represents 0 dropout probability in both layers.

Future Research Directions
Overall, transferring knowledge from the source domain increased model accuracy when applied to the target domain data. Using more than two datasets could increase this benefit further, especially using more similar datasets, e.g., from multiple industrial fermentation processes. The two datasets used in this work had distinct differences. For example, no temperature control on the laboratory scale dataset and an increasing time of flight during fermentation for the industrial scale dataset. It is anticipated that more similar datasets would provide even greater benefits. Furthermore, other than increasing model accuracy, the domain adaptation methodology can also reduce the time for ML model development. After training across two domains, the final models could be used to predict using data from a new fermentation process without having been trained on this new domain. However, incorporation of a small number of batches from this new fermentation process would be expected to aid model accuracy.
In this work, the waveform energy was the single feature used to describe the oscillating part of the US waveform. The reason for this was that previous work demonstrated that multiple oscillating waveform features are unlikely to follow similar trends across domains and their inclusion would degrade model accuracy [21]. However, for many applications of ML and US sensors, multiple features may need to be used to accurately monitor changes in this portion of the US waveform. In this case, the methodologies presented in this work may be used to obtain predictions on the target domain data from models trained on both the source and target domains. These predictions can then be used as an additional feature in a model only trained on the target domain data. In this way, other features describing the oscillating part of the waveform can be used as no domain adaptation is required while also incorporating knowledge from the source domain.
The combination of ML and US measurements should be used in further research over calibration procedures. In this work, the speed of sound increased during fermentation, agreeing with [14,15], which were conducted at large scale, but contradicting [16,17,49], which were conducted at small scale. This indicates that there is a discrepancy in the speed of sound trend at the ethanol, sugar, yeast, and CO 2 concentrations and temperature used at small and large scales. Therefore, extensive and complicated calibration procedures would need to be used to account for these effects. In addition, ML offers several distinct advantages: it negates the need for these complex calibration procedures accounting for all the parameters previously listed; more information from the waveforms is typically used through feature extraction; more complex fitting procedures are used, allowing for increased prediction accuracy; and validation procedures encourage model accuracy even on process parameters outside the range the model was trained on.
Acceptable ML model accuracy is dependent on its desired application. In this work, the highest accuracy model (federated learning, zero dropout, four industrial training batches) achieved 99.8% and 99.9% for predicting the start and end of ethanol production, respectively. This is equivalent to the current method of determination, off-line wort density measurements using hydrometers, which are only conducted once every several hours (or even less frequently overnight) and have reduced accuracy when foam is present. However, these model accuracies were obtained using only a single test set batch and therefore a large dataset size would be needed to determine whether these accuracies were consistent.
US measurements and ML could also be used in combination with sampling methods to reduce the amount of sampling required (and therefore also reducing operator burden), provide timely results between samples (for example, overnight), and predict when fermentation stages will be reached to improve plant scheduling. In this case, ML models can be continuously updated using the labelled data from the sample measurements. If US sensors are desired to eliminate the use of sampling, higher accuracy models would be required and longer model development times would be needed. In addition, a model that stated a confidence level of its prediction would increase trust in the model by identifying when sample measurements should be used as a safeguard.

Conclusions
This work has used previously collected US sensor data from laboratory scale fermentations to improve ML model accuracy on an industrial scale process. Overall, all methodologies led to improvements in model accuracy over training on the target domain alone. The federated learning methodology performed best, achieving higher accuracy for 14 out of 16 machine learning tasks compared with the base case model, and achieving around 100% test set accuracy when trained on four industrial datasets and no dropout was used. Federated learning improved model accuracy over the traditional simultaneous domain training by allowing increased tuning of the network weights to converge on local target domain optima. However, fine-tuning led to a decrease in model accuracy due to overfitting of networks caused by the larger number of neurons and LSTM units needed to accurately train on both domains. The methodologies investigated not only provide increased accuracy, but also speed up model development time by reducing the number of fermentation runs required to be monitored in the target domain.