1. Introduction
Over the last few decades, the main ambition of the pharmaceutical industry has been to provide optimum quality products that adhere to ever-updated good manufacturing practice (GMP) regulations which need to be applied throughout all stages of drug manufacturing [
1]. Data integrity, which refers to the accuracy and consistency of data over its lifecycle, is an essential part of GMP regulations, and an issue of paramount importance in any pharmaceutical industry quality system, which must ensure that medicines are of the required quality [
2]. In particular, data integrity is critical in pharmaceutical manufacturing because it directly impacts patient safety and product quality. Pharmaceutical manufacturing processes are highly regulated, with strict guidelines set by regulatory agencies such as the FDA (Food and Drug Administration) [
3] and the EMA (European Medicines Agency) [
4]. These guidelines require that pharmaceutical production companies must ensure the accuracy, completeness, and consistency of their data, including production and quality control data.
Ensuring data integrity is crucial for maintaining the quality of pharmaceutical products, as any errors or inconsistencies in data could result in drug product defects, contamination, or even harm to patients. For example, if production data are not properly recorded, it could lead to incorrect dosages or missing ingredients in a specific drug, which consequently could have serious consequences for patients taking this drug. In addition to patient safety, data integrity is also important for regulatory compliance. Regulatory agencies expect pharmaceutical companies to maintain accurate and complete product records for demonstrating that their products are safe, effective, and manufactured in compliance with applicable regulations. Ensuring data integrity requires a systematic combination of certain technical aspects, such as secure data storage and access controls, as well as organizational aspects, such as training policies and procedures. Furthermore, pharmaceutical production companies must conduct regular audits and reviews of their data to identify and resolve any problematic issues that may arise due to data integrity violations.
To this purpose, the ALCOA+ principles have been defined by the U.S. Food and Drug Administration to provide accurate, complete, consistent, enduring, and available data. These data qualities are indispensable for data and metadata integrity in pharmaceutical manufacturing [
2]. However, until recently [
5], the ALCOA+ principles referred to non-quantifiable properties that data should possess and the evaluation of whether the data adhered to the ALCOA+ principles was a purely qualitative task. Various data processing and management methods have so far been proposed in aiming to ensure the observance of the ALCOA+ data integrity rules, in line with current pharmaceutical industry regulatory frameworks [
6]. Regulatory frameworks such as ALCOA+ generally include well-defined structured rules, reflecting applicable laws and regulations, imposed by international bodies. Failure to comply with the respective regulations can be disastrous for pharmaceutical manufacturing, as it may cause long delays and increase costs. In this respect, achieving data integrity through satisfaction of the ALCOA+ requirements is a fundamental step towards ensuring compliance to the respective pharmaceutical regulations, which further results in more efficient processing and increased product quality.
The ALCOA+ principles are the following [
7]:
- 1.
Attributable: This means that all collected data should include information about the individuals who collected the data, the individuals who took action, and the time when the action was carried out.
- 2.
Legible: This refers to the requirement for data to be accurate and comprehensible. However, legibility encompasses more than just reading the written content, it also involves understanding the surrounding context of the information.
- 3.
Contemporaneous: This addresses the issue of recording data promptly and accurately for both individuals and systems. When dealing with electronic data, it is customary to timestamp the activities or actions to ensure a chronological order.
- 4.
Original: This refers to keeping records in their original form rather than using duplicates or transcriptions, especially when it comes to manual record-keeping. The initial recording of the data, whether it is on paper or in a digital system, should serve as the primary record. In the case of digitally recorded data, it is crucial to have technical and procedural measures set to prevent alterations to the original recording.
- 5.
Accurate: This ensures that all records accurately depict the actual events without any mistakes. Moreover, it is crucial to refrain from altering the original information in a manner that causes its loss (e.g., compression, rounding of numerical values, acronyms, etc.).
- 6.
Complete: It is necessary for all captured information to possess a comprehensive audit trail to demonstrate the absence of deletions or losses. This requirement encompasses not only the primary data recording but also includes metadata, retest data, analysis data, and other related elements. Additionally, audit trails must be in place to track any modifications made to the data.
- 7.
Consistent: This refers to the need for data timestamping, that must be sequential in an ascending order. Timestamps should also be added to indicate modifications made to the original data recording.
- 8.
Enduring: This refers to the long-term storage of data that must be able to be read and understood for many years after their creation.
- 9.
Available: This refers to the ability of data to be accessible to every interested party and at any time during its life cycle.
Assessment of whether pharmaceutical process data comply with ALCOA+ principles is currently a tedious and sometimes cumbersome task, involving largely manual investigations of lengthy reports and audit trails tracing through general-purpose software. Although some guidelines for automating ALCOA assessment have been introduced [
5], they still focus on identifying compliance violations only after they have taken place, thus not preventing production delays and increases in production cost due to faulty batches being produced.
Furthermore, raw data in the pharmaceutical industry are often heterogeneous and come in vast amounts and various types, as generated from pharmaceutical plant data sources, thus making data integrity and data management challenging tasks. To make things worse, various state-of-the-art technologies in contemporary pharmaceutical industries, such as cyber-physical systems, and process analytical technologies (PAT), have posed stringent demands for mass and smart manufacturing [
8]. Given the sheer amounts of available data and the extreme variety in data types and content, artificial intelligence techniques and, in particular, deep learning models, provide efficient and adaptable solutions. Especially, when compared against machine learning or statistical methods, deep learning techniques can substantially alleviate the burden to manually process, transform and annotate large non-homogeneous datasets and also to provide data integrity and ALCOA+ compliance [
2]. One step beyond such heterogeneous data-processed information, along with selected historical raw data and alarm signals, could be an adequate input for deep-learning-based prediction models and advanced data analytics. Such methods can derive safe prognosis results regarding ALCOA+ compliance and maintenance, as well as production line health conditions, and production line maintenance. It is therefore imperative for any pharmaceutical manufacturing system to be able to support ALCOA+ compliance prediction while also maintaining ALCOA+ standards and preserving a steady qualitative performance.
In this paper, we present ALCOAi, a transformer-based deep learning network for calculating and forecasting each of the ALCOA+ principles mentioned above. ALCOAi is used for sophisticated data processing and management of continuously increasing raw data coming from the pharmaceutical manufacturing lines, in conjunction with the strict criteria and rules imposed by pharmaceutical regulatory frameworks, with the overall aim of maintaining ALCOA+ compliance. This model was tested on real pharmaceuticals manufacturing line datasets and also compared and evaluated, in terms of prediction efficiency, against other prominent state-of-the-art time-series forecasting methods.
The main contributions of this work are the following:
it describes an approach to assess compliance with ALCOA+ principles;
it describes a model to predict compliance with ALOCA+ principles;
it evaluates the AI models on a real dataset consisting of data stemming from two different pharmaceuticals production lines.
The rest of this paper is structured as follows:
Section 2 is dedicated to exploring related work, and
Section 3 to the transformer-based network solution employed for resolving the problem of ALCOA+ prediction. Performance evaluation of the proposed model is included in
Section 4, along with a discussion and analysis of the obtained results. Finally, conclusions are drawn and future directions are given.
2. Related Work
Time-series forecasting refers to the process of predicting future values of a time-dependent variable based on its historical behavior. In particular, time-series forecasting for data integrity is a niche research area that deals with predicting the behavior of data over time to maintain its accuracy and consistency. While there is not an extensive amount of literature on this specific topic, there are related works that address the broader themes of signal processing, time-series forecasting, and data integrity. Along this line, transformer-based AI models have shown significant improvements in various NLP tasks [
9] and have also been successfully applied to time-series forecasting [
10]. Various state-of-the-art machine learning techniques have been introduced and compared with respect to their performance and efficiency in data signal processing and data fusion [
11], and their advantages and disadvantages have been discussed in detail [
12]. Furthermore, advanced deep learning methods for efficiently managing big heterogeneous data in [
13] have been proposed. In addition, data fusion and processing methods focused on IoT and sensor-based industrial environments and applications have been suggested and compared in [
14,
15,
16]. Data processing methods have been also applied in the broader industry sector as well [
17]. More specifically, representative advanced deep learning methods have been proposed in [
18,
19] for efficiently managing complex and big datasets in industrial applications.
As far as the pharmaceutical industry domain is concerned, various state-of-the-art AI and machine learning methodologies have been applied in drug discovery and development. Representative approaches are described in [
20,
21,
22,
23,
24]. Yet, the potential and perspectives of adopting machine and deep-learning-based methods for assuring ALCOA+ compliance and maintenance prediction in the pharmaceuticals manufacturing sector has not been investigated so far. A prominent exception is an ALCOA+ compliance software tool [
5] implementing an analytic model for quantifying the ALCOA+ principles. However, that tool lacks any ability to predict how ALCOA+ values will evolve in time.
3. Description of the ALCOAi Model
The ALCOA+ principles are of paramount importance in today’s pharmaceutical industry as these principles emphasize describing and validating the drug creation process in an end-to-end manner. Since it is now possible to quantify the ALCOA+ principles, pharmaceutical manufacturing industries can define thresholds under which individual principles can be considered violated. Violations of the ALCOA+ principles are a common occurrence and pharmaceuticals companies must dedicate sufficient resources to ensure ALCOA+ compliance.
While much of the data needed to assess ALCOA+ compliance are compiled automatically by the automated systems of the production lines, many types of data, and especially human-derived data are often missing or partially compiled, with a consequent decrease in data quality and integrity. However, and given the workload of pharmaceutical industries, operator behavior trends often lead them to overlook compiling seemingly unimportant information, such as operator identification, or adding notes about crucial production events, with a consequent detrimental effect on ALCOA+ adherence.
For all these reasons, the research and development work performed in the context of the SPuMoNI European project (
www.spumoni.eu) aims at maximizing the possibility of maintaining a high ALCOA+ compliance standard and, as a project result, a set of predictive AI-based models has been developed. These models monitor pharmaceutical production data continuously and make predictions about ALCOA+ principle values trends based on current and past data compilation behaviors. The output of these models can be used by the pharmaceutical production managers as a decision support tool for making decisions regarding adopting necessary measures for best data quality and integrity practices. The ensemble of these models is called ALCOAi and it is composed of two main components:
For each lot of pharmaceuticals manufactured, or else, for each production batch, the data and meta-data generated during the manufacturing process are associated with a set of ALCOA+ values.
The ALCOAi regression model is used to calculate ALCOA+ values based on the complete available data regarding each production batch. The ALCOAi forecasting network employs a transformer network capable of forecasting ALCOA+ principle values based on past batch production data and the corresponding ALCOA+ values.
More specifically, the general architecture of the ALCOAi model is shown in
Figure 1. The model is an ensemble of deep learning networks tailored to processing different kinds of data. Operator comments (in the form of free-text) are processed by a language modeling transformer [
25], and raw sensor data streams are processed by neurons able to identify patterns in data sequences (GRUs [
26]). Categorical and single numerical values are processed by linear neural networks. The output of each individual model is then concatenated and fed to a linear regressor network where ALCOA+ values are computed.
The modularity of the approach permits the adaptability to different datasets. In fact, in the case of different data types being introduced in the datasets (e.g., images), a CNN feature extractor could be employed in the processing stage of the ALCOAi pipeline and concatenated to the input layer of the regression layer.
The network is trained to minimize the mean squared error loss function:
where
is the estimated set of each ALCOA+ principle value (
i) and
is the actual one.
Transformers are a type of neural network architecture that have been traditionally used for natural language processing tasks. However, they have recently gained popularity in time-series forecasting [
10,
27], where they are able to effectively capture complex patterns in the time-series data as well as relationships with other types of variables.
The general architecture of a transformer consists of an encoder and a decoder, as depicted in
Figure 2. The encoder takes as input a sequence of data and generates a representation of that sequence. The decoder then uses this representation to generate an output sequence. In time-series forecasting with external data, the input sequence is typically a window of historical time-series data along with the corresponding values of external variables.
The task of detecting when an ALCOA+ principle is going to be violated is posed, again, as a regression problem: given a set of input data the model creates a set of ALCOA+ principle predictions for the next production batch. At each subsequent iteration, the output of the model at the previous iteration is concatenated to the input to produce predictions further in time (
Figure 3).
Similar to the regression task, we employ the MSE loss, but this time the individual losses are also summed in the time axis as in the following formula:
where
T is the number of future predictions that we need,
is the predicted value of the
ALCOA+ principle at time
t (in batches), and
is the actual value of the
ALCOA+ principle at time
t.
By setting manual thresholds (according to individual internal business constraints) for each one of the ALCOA+ principles, the ALCOAi model is capable of detecting when and whether the next principle violation will occur, in terms of batch productions, if all conditions are kept the same as they are at the time of analysis. In this way, the model offers the functionality of ALCOA+ principle violation probability on a per-future-batch basis (e.g., the probability to violate an ALCOA+ principle at the next batch, the one after that, etc.). In this way, and depending on the results of the analysis, an ALCOA+ trend curve can be constructed (
Figure 4).