Deep Transformers for Computing and Predicting ALCOA+ Data Integrity Compliance in the Pharmaceutical Industry

: Strict adherence to data integrity and quality standards is crucial for the pharmaceutical industry to minimize undesired effects and ensure that medicines are of the required quality and safe for patients. A common data quality standard in the pharmaceutical industry is ALCOA+, which is a set of guiding principles for ensuring data integrity. Failure to comply with ALCOA+ guidelines, usually detected after audit inspections, may result in serious consequences for pharmaceutical manufacturers, such as the incurrence of ﬁnes, increase in costs, and production delays. It is, therefore, imperative to devise methods able to monitor ALCOA+ compliance and detect decreasing trends in data quality automatically. In this paper we present ALCOAi, a deep learning model based on the transformer architecture, which is able to process large quantities of non-homogeneous data and compute current and future ALCOA+ compliance. The proposed model can estimate trends concerning most ALCOA+ principles. The model was tested on a real dataset comprising raw sensor data, machine-provided values, and human-entered free-text data from two pharmaceutical manufacturing lines. The performed tests led to promising results in forecasting ALCOA+ compliance.


Introduction
Over the last few decades, the main ambition of the pharmaceutical industry has been to provide optimum quality products that adhere to ever-updated good manufacturing practice (GMP) regulations which need to be applied throughout all stages of drug manufacturing [1].Data integrity, which refers to the accuracy and consistency of data over its lifecycle, is an essential part of GMP regulations, and an issue of paramount importance in any pharmaceutical industry quality system, which must ensure that medicines are of the required quality [2].In particular, data integrity is critical in pharmaceutical manufacturing because it directly impacts patient safety and product quality.Pharmaceutical manufacturing processes are highly regulated, with strict guidelines set by regulatory agencies such as the FDA (Food and Drug Administration) [3] and the EMA (European Medicines Agency) [4].These guidelines require that pharmaceutical production companies must ensure the accuracy, completeness, and consistency of their data, including production and quality control data.
Ensuring data integrity is crucial for maintaining the quality of pharmaceutical products, as any errors or inconsistencies in data could result in drug product defects, contamination, or even harm to patients.For example, if production data are not properly recorded, it could lead to incorrect dosages or missing ingredients in a specific drug, which consequently could have serious consequences for patients taking this drug.In addition to patient safety, data integrity is also important for regulatory compliance.Regulatory agencies expect pharmaceutical companies to maintain accurate and complete product records for demonstrating that their products are safe, effective, and manufactured in compliance with applicable regulations.Ensuring data integrity requires a systematic combination of certain technical aspects, such as secure data storage and access controls, as well as organizational aspects, such as training policies and procedures.Furthermore, pharmaceutical production companies must conduct regular audits and reviews of their data to identify and resolve any problematic issues that may arise due to data integrity violations.
To this purpose, the ALCOA+ principles have been defined by the U.S. Food and Drug Administration to provide accurate, complete, consistent, enduring, and available data.These data qualities are indispensable for data and metadata integrity in pharmaceutical manufacturing [2].However, until recently [5], the ALCOA+ principles referred to nonquantifiable properties that data should possess and the evaluation of whether the data adhered to the ALCOA+ principles was a purely qualitative task.Various data processing and management methods have so far been proposed in aiming to ensure the observance of the ALCOA+ data integrity rules, in line with current pharmaceutical industry regulatory frameworks [6].Regulatory frameworks such as ALCOA+ generally include well-defined structured rules, reflecting applicable laws and regulations, imposed by international bodies.Failure to comply with the respective regulations can be disastrous for pharmaceutical manufacturing, as it may cause long delays and increase costs.In this respect, achieving data integrity through satisfaction of the ALCOA+ requirements is a fundamental step towards ensuring compliance to the respective pharmaceutical regulations, which further results in more efficient processing and increased product quality.
Attributable: This means that all collected data should include information about the individuals who collected the data, the individuals who took action, and the time when the action was carried out.

2.
Legible: This refers to the requirement for data to be accurate and comprehensible.However, legibility encompasses more than just reading the written content, it also involves understanding the surrounding context of the information.

3.
Contemporaneous: This addresses the issue of recording data promptly and accurately for both individuals and systems.When dealing with electronic data, it is customary to timestamp the activities or actions to ensure a chronological order.

4.
Original: This refers to keeping records in their original form rather than using duplicates or transcriptions, especially when it comes to manual record-keeping.The initial recording of the data, whether it is on paper or in a digital system, should serve as the primary record.In the case of digitally recorded data, it is crucial to have technical and procedural measures set to prevent alterations to the original recording.

5.
Accurate: This ensures that all records accurately depict the actual events without any mistakes.Moreover, it is crucial to refrain from altering the original information in a manner that causes its loss (e.g., compression, rounding of numerical values, acronyms, etc.).6.
Complete: It is necessary for all captured information to possess a comprehensive audit trail to demonstrate the absence of deletions or losses.This requirement encompasses not only the primary data recording but also includes metadata, retest data, analysis data, and other related elements.Additionally, audit trails must be in place to track any modifications made to the data.7.
Consistent: This refers to the need for data timestamping, that must be sequential in an ascending order.Timestamps should also be added to indicate modifications made to the original data recording.8.
Enduring: This refers to the long-term storage of data that must be able to be read and understood for many years after their creation.

9.
Available: This refers to the ability of data to be accessible to every interested party and at any time during its life cycle.
Assessment of whether pharmaceutical process data comply with ALCOA+ principles is currently a tedious and sometimes cumbersome task, involving largely manual investigations of lengthy reports and audit trails tracing through general-purpose software.Although some guidelines for automating ALCOA assessment have been introduced [5], they still focus on identifying compliance violations only after they have taken place, thus not preventing production delays and increases in production cost due to faulty batches being produced.
Furthermore, raw data in the pharmaceutical industry are often heterogeneous and come in vast amounts and various types, as generated from pharmaceutical plant data sources, thus making data integrity and data management challenging tasks.To make things worse, various state-of-the-art technologies in contemporary pharmaceutical industries, such as cyber-physical systems, and process analytical technologies (PAT), have posed stringent demands for mass and smart manufacturing [8].Given the sheer amounts of available data and the extreme variety in data types and content, artificial intelligence techniques and, in particular, deep learning models, provide efficient and adaptable solutions.Especially, when compared against machine learning or statistical methods, deep learning techniques can substantially alleviate the burden to manually process, transform and annotate large non-homogeneous datasets and also to provide data integrity and AL-COA+ compliance [2].One step beyond such heterogeneous data-processed information, along with selected historical raw data and alarm signals, could be an adequate input for deep-learning-based prediction models and advanced data analytics.Such methods can derive safe prognosis results regarding ALCOA+ compliance and maintenance, as well as production line health conditions, and production line maintenance.It is therefore imperative for any pharmaceutical manufacturing system to be able to support ALCOA+ compliance prediction while also maintaining ALCOA+ standards and preserving a steady qualitative performance.
In this paper, we present ALCOAi, a transformer-based deep learning network for calculating and forecasting each of the ALCOA+ principles mentioned above.ALCOAi is used for sophisticated data processing and management of continuously increasing raw data coming from the pharmaceutical manufacturing lines, in conjunction with the strict criteria and rules imposed by pharmaceutical regulatory frameworks, with the overall aim of maintaining ALCOA+ compliance.This model was tested on real pharmaceuticals manufacturing line datasets and also compared and evaluated, in terms of prediction efficiency, against other prominent state-of-the-art time-series forecasting methods.
The main contributions of this work are the following: • it describes an approach to assess compliance with ALCOA+ principles; • it describes a model to predict compliance with ALOCA+ principles; • it evaluates the AI models on a real dataset consisting of data stemming from two different pharmaceuticals production lines.
The rest of this paper is structured as follows: Section 2 is dedicated to exploring related work, and Section 3 to the transformer-based network solution employed for resolving the problem of ALCOA+ prediction.Performance evaluation of the proposed model is included in Section 4, along with a discussion and analysis of the obtained results.Finally, conclusions are drawn and future directions are given.

Related Work
Time-series forecasting refers to the process of predicting future values of a timedependent variable based on its historical behavior.In particular, time-series forecasting for data integrity is a niche research area that deals with predicting the behavior of data over time to maintain its accuracy and consistency.While there is not an extensive amount of literature on this specific topic, there are related works that address the broader themes of signal processing, time-series forecasting, and data integrity.Along this line, transformer-based AI models have shown significant improvements in various NLP tasks [9] and have also been successfully applied to time-series forecasting [10].Various state-of-the-art machine learning techniques have been introduced and compared with respect to their performance and efficiency in data signal processing and data fusion [11], and their advantages and disadvantages have been discussed in detail [12].Furthermore, advanced deep learning methods for efficiently managing big heterogeneous data in [13] have been proposed.In addition, data fusion and processing methods focused on IoT and sensor-based industrial environments and applications have been suggested and compared in [14][15][16].Data processing methods have been also applied in the broader industry sector as well [17].More specifically, representative advanced deep learning methods have been proposed in [18,19] for efficiently managing complex and big datasets in industrial applications.
As far as the pharmaceutical industry domain is concerned, various state-of-theart AI and machine learning methodologies have been applied in drug discovery and development.Representative approaches are described in [20][21][22][23][24]. Yet, the potential and perspectives of adopting machine and deep-learning-based methods for assuring ALCOA+ compliance and maintenance prediction in the pharmaceuticals manufacturing sector has not been investigated so far.A prominent exception is an ALCOA+ compliance software tool [5] implementing an analytic model for quantifying the ALCOA+ principles.However, that tool lacks any ability to predict how ALCOA+ values will evolve in time.

Description of the ALCOAi Model
The ALCOA+ principles are of paramount importance in today's pharmaceutical industry as these principles emphasize describing and validating the drug creation process in an end-to-end manner.Since it is now possible to quantify the ALCOA+ principles, pharmaceutical manufacturing industries can define thresholds under which individual principles can be considered violated.Violations of the ALCOA+ principles are a common occurrence and pharmaceuticals companies must dedicate sufficient resources to ensure ALCOA+ compliance.
While much of the data needed to assess ALCOA+ compliance are compiled automatically by the automated systems of the production lines, many types of data, and especially human-derived data are often missing or partially compiled, with a consequent decrease in data quality and integrity.However, and given the workload of pharmaceutical industries, operator behavior trends often lead them to overlook compiling seemingly unimportant information, such as operator identification, or adding notes about crucial production events, with a consequent detrimental effect on ALCOA+ adherence.
For all these reasons, the research and development work performed in the context of the SPuMoNI European project (www.spumoni.eu)aims at maximizing the possibility of maintaining a high ALCOA+ compliance standard and, as a project result, a set of predictive AI-based models has been developed.These models monitor pharmaceutical production data continuously and make predictions about ALCOA+ principle values trends based on current and past data compilation behaviors.The output of these models can be used by the pharmaceutical production managers as a decision support tool for making decisions regarding adopting necessary measures for best data quality and integrity practices.The ensemble of these models is called ALCOAi and it is composed of two main components: For each lot of pharmaceuticals manufactured, or else, for each production batch, the data and meta-data generated during the manufacturing process are associated with a set of ALCOA+ values.
The ALCOAi regression model is used to calculate ALCOA+ values based on the complete available data regarding each production batch.The ALCOAi forecasting network employs a transformer network capable of forecasting ALCOA+ principle values based on past batch production data and the corresponding ALCOA+ values.
More specifically, the general architecture of the ALCOAi model is shown in Figure 1.The model is an ensemble of deep learning networks tailored to processing different kinds of data.Operator comments (in the form of free-text) are processed by a language modeling transformer [25], and raw sensor data streams are processed by neurons able to identify patterns in data sequences (GRUs [26]).Categorical and single numerical values are processed by linear neural networks.The output of each individual model is then concatenated and fed to a linear regressor network where ALCOA+ values are computed.
The modularity of the approach permits the adaptability to different datasets.In fact, in the case of different data types being introduced in the datasets (e.g., images), a CNN feature extractor could be employed in the processing stage of the ALCOAi pipeline and concatenated to the input layer of the regression layer.
The network is trained to minimize the mean squared error loss function: where p i is the estimated set of each ALCOA+ principle value (i) and y i is the actual one.Transformers are a type of neural network architecture that have been traditionally used for natural language processing tasks.However, they have recently gained popularity in time-series forecasting [10,27], where they are able to effectively capture complex patterns in the time-series data as well as relationships with other types of variables.
The general architecture of a transformer consists of an encoder and a decoder, as depicted in Figure 2. The encoder takes as input a sequence of data and generates a representation of that sequence.The decoder then uses this representation to generate an output sequence.In time-series forecasting with external data, the input sequence is typically a window of historical time-series data along with the corresponding values of external variables.
The task of detecting when an ALCOA+ principle is going to be violated is posed, again, as a regression problem: given a set of input data the model creates a set of ALCOA+ principle predictions for the next production batch.At each subsequent iteration, the output of the model at the previous iteration is concatenated to the input to produce predictions further in time (Figure 3).
Similar to the regression task, we employ the MSE loss, but this time the individual losses are also summed in the time axis as in the following formula: where T is the number of future predictions that we need, p it is the predicted value of the i th ALCOA+ principle at time t (in batches), and y it is the actual value of the i th ALCOA+ principle at time t.By setting manual thresholds (according to individual internal business constraints) for each one of the ALCOA+ principles, the ALCOAi model is capable of detecting when and whether the next principle violation will occur, in terms of batch productions, if all conditions are kept the same as they are at the time of analysis.In this way, the model offers the functionality of ALCOA+ principle violation probability on a per-future-batch basis (e.g., the probability to violate an ALCOA+ principle at the next batch, the one after that, etc.).In this way, and depending on the results of the analysis, an ALCOA+ trend curve can be constructed (Figure 4).

Performance Evaluation 4.1. Dataset Description and Hyperparameters
All data for this research were provided by a pharmaceutical manufacturing company which participates in the SPuMoNI Project.Part of these data were obtained from two production lines and were properly configured to register values from sensors throughout the production process.These values were mainly some physico-chemical parameters such as temperature, humidity, speed, pressure values, etc.
The first production line data (I1000) contains 176 batches of independent drug batches and the second production line data (I600) contains 296 batches of independent drug batches.Each batch is treated as a new cycle of the production process.When a batch ends it starts a new cycle of drug production so consecutive batch productions are ordered chronologically.The data from the two production lines comprise the values from sensors and related data and metadata and JSON files describing the ALCOA values for each batch.
For the regression part of our experimentation, the dataset was created and organized as follows: 1.
Gather each input data sample and associate a set of the corresponding ALCOA+ scores calculated as in [5].

2.
Normalize and encode each input sample as described in the previous section.

3.
Feed each data sample to the network and gather its output.

4.
Compare the output of the network to the corresponding effective ALCOA+ values and calculate the mean squared error loss (Formula (1)).
The ALCOA+ principle values were calculated according to the rules defined in [5].The hyperparameters which have been used for the training of all models are presented in Table 1.All of the the experiments described herein were written in the Python programming language and using the Tensorflow framework (https://www.tensorflow.org/,accessed on 27 June 2023).The experiments were executed on a PC with an AMD Ryzen 7 5800X CPU, with 64 GB of RAM and an nVidia Titan X GPU with 12 GB of VRAM.

Network Configuration Selection
Network configuration selection consists in conducting repeated experiments with varying network hyperparameters with the aim of identifying the optimal set of settings for the network's architecture and training configuration.At the beginning of the experiment, the hyperparameter selection was performed with the aim of improving the performance of the models so that they obtain the best possible results.The three models were trained with varying hyperparameter values and tested on the validation set.Then, the best model architecture was identified by selecting the one that achieved the lowest validation loss.Table 2 shows the optimal hyperparameters for the three tested network architectures.

Results and Discussion
The purpose of the performed experimental evaluation is to assess the performance of different AI models in regressing and predicting the values of the ALCOA+ principles by identifying correlations with manufacturing line data.In Table 3, low-level statistical information (maximum, minimum, average, and standard deviation) are shown regarding the values contained withing the analyzed dataset.
Based on these values, it is important to note at this point that for some of the ALCOA+ principles it was not feasible to extract meaningful patterns in the data.This was the case when the ALCOA+ values did not include any informative content because there was no value range (original, available, consistent and enduring).So, in the experiments presented in this work we only included the ALCOA+ principles that we had meaningful data on, specifically, the attributable, legible, contemporaneous, accurate, and complete principles.Based on the available data we trained a regression model, as described in the previous section, and then we used this model to aid the training of the predictive model.The performance of this latter model, was subsequently compared against the following stateof-the-art time-series processing network types: • GRU [30] • LSTM [31] The results obtained by this performance evaluation are shown in Table 4.For training and evaluation, k-fold cross-validation leave-one-out (k = 10) was used.The values reported are the average values obtained across all folds from all runs.The best results were achieved using the proposed model on both datasets across all ALCOA+ principles.In particular, while the GRU-and LSTM-based models achieved more or less similar performance, the ALCOAi model outperformed them with a safe margin in almost all of the cases.The lowest performance gap was observed in the regression result of the legible principle, while the largest one was observed for the contemporaneous principle.

ALCOA+ Value Prediction
Regarding the prediction of the ALCOA+ values, we tested our model and compared its performance against an LSTM and a GRU model on the same datasets.For this evaluation, we ordered data batches in a chronologically descending manner creating a batch data vector and fed it to the models, batch by batch.The output of the model following the last training batch data forward step was considered the output of the model at batch step 0. This result then was appended to the start of the batch data vector (becoming the most recent data sample) and re-fed to the models.The output of this operation was the predicted ALCOA+ values at batch step 1.This output was then again appended to the start of the previous obtained vector, and so on.This process was repeated until batch step 50, because after that moment average errors started to rise producing nonsensical results.
The comparison results clearly show that, on average, the ALCOAi model surpasses the performance of the GRU and LSTM models (Figure 5).In particular, as shown in Tables 5 and 6, the ALCOAi model consistently achieves lower MAE values, averaged across all ALCOA+ principles, when compared against the other two models on both production line datasets.However, all of the models demonstrated a similar exponential trend in the calculated MAE with increasing batch numbers, meaning that while we are able to model ALCOA+ value trends in the near future with some accuracy, predicting such values further in time becomes inconsistent and unreliable.The exact threshold that defines whether an MAE value is acceptable or not depends on individual business policies.
Nevertheless, the disparity in performance between the ALCOAi and the other models becomes more evident as batch numbers increase, meaning that the ALCOAi model is the most suitable method (within an acceptable margin for error) among the models tested in this work.Further, all models performed worse on the I1000 dataset and that can be attributed to the lower number of samples available for training in that set.Table 5. Mean absolute error (MAE) between real values and predicted ones for ALCOA+ attribute prediction on the I600 dataset.The ALCOA+ principles were calculated by the prediction model at time intervals of 5 batches.The values reported refer to the performance of the models obtained on the test set.The rows with the "difference" header refer to the percent difference of the corresponding model when compared against ALCOAi.

Conclusions
In this paper, we have presented two neural network models for regressing and predicting ALCOA+ values based on a set of parameters relevant to pharmaceutical man-ufacturing lines.This model was trained on two separate real datasets stemming from two production lines, achieving a high accuracy level and demonstrating the effectiveness of using neural networks for this task.In fact, the conducted experiments indicate that predicting ALCOA+ values is possible and the predictions made can contribute to decision making to deal with data integrity issues according to the business policies of the pharmaceutical manufacturers.The experiments also shown that transformer-based networks perform better when compared to traditional time-series processing neural networks such as the LSTM and GRU, and the performance gap increases as predictions are made further forward in time.
As future work, we aim to integrate explainable AI methods (such as [32]) to weight and highlight the data fields that contribute the most in the pharmaceutical manufacturing data integrity assessment based on the ALCOA+ principles.

Figure 1 .
Figure 1.The pipeline of the ALCOAi model for the regression of ALCOA+ values based on nonhomogeneous data processing.

Figure 2 .
Figure 2. The model for the forecasting of future ALCOA+ values based on previous evaluations.

Figure 3 .
Figure 3.The ALCOA+ forecasting process based on the transformer [28].The per-batch ALCOA+ values are sorted in a chronologically descending manner and fed to the transformer batch by batch.When the last (i.e., most recent) set of values is forwarded in the network, the output of the transformer reflects the predicted ALCOA+ values of the next batch.This output is then fed to the input of the transformer model to obtain the prediction of the next batch.This process can be repeated indefinitely.

Figure 4 .
Figure 4.By generating successive predictions based on the process shown in Figure 3, trends in the evolution of ALCOA+ values can be derived.

Figure 5 .
Figure 5. Performance evaluation of ALCOA+ principle prediction of the three models on the I600 dataset (left) and the I1000 dataset (right).

Table 1 .
Hyperparameters used during training of the models.

Table 2 .
Network hyperparameters identified as optimal for the LSTM, GRU, and ALCOAi models.

Table 3 .
Statistical information for each of the ALCOA+ principles used in the evaluation.The principles with an asterisk did not contain any meaningful data.

Table 4 .
Mean absolute error (MAE) between real values and predicted ones for ALCOA+ attribute regression.The ALCOA+ principles that were not used during the training sessions are omitted.The values reported refer to the performance of the models obtained on the test set.The best performance is written in bold.

Table 6 .
Mean absolute error (MAE) between real values and predicted ones for ALCOA+ attribute prediction on the I1000 dataset.The ALCOA+ principles were calculated by the prediction model at time intervals of 5 batches.The values reported refer to the performance of the models obtained on the test set.The rows with the "difference" header refer to the percent difference of the corresponding model when compared against ALCOAi.Difference 42.54 63.15 33.44 30.29 42.84 70.69 68.06 90.78 135.82 164.61