Schematics Retrieval Using Whole-Graph Embedding Similarity

: This paper addresses the pressing environmental concern of plastic waste, particularly in the biopharmaceutical production sector, where single-use assemblies (SUAs) significantly contribute to this issue. To address and mitigate this problem, we propose a unique approach centered around the standardization and optimization of SUA drawings through digitization and structured representation. Leveraging the non-Euclidean properties of SUA drawings, we employ a graph-based representation, utilizing graph convolutional networks (GCNs) to capture complex structural relationships. Introducing a novel weakly supervised method for the similarity-based retrieval of SUA graph networks, we optimize graph embeddings in a low-dimensional Euclidean space. Our method demonstrates effectiveness in retrieving similar graphs that share the same functionality, offering a promising solution to reduce plastic waste in pharmaceutical assembly processes.


Introduction
The annual production of plastic by humans reaches approximately 500 million tons, with 40% designated for single-use applications [1,2].This extensive production, coupled with the prolonged degradation of plastic compounds, exacerbates the situation, leading to significant environmental plastic accumulation and raising paramount concerns for ecosystems and human health [3].As a result, the attention to managing plastic waste is becoming increasingly important in society.
The biopharmaceutical production sector significantly contributes to the quantum of plastic waste, a consequence of the pervasive implementation of single-use assemblies (SUAs) in pharmaceutical manufacturing.SUA, as employed in biopharmaceutical production, refers to the utilization of disposable equipment and components, such as containers, tubing, and filters, for a single manufacturing process.A persistent perception lingers regarding the environmental implications of SUA products relative to traditional methodologies in biomanufacturing, primarily attributed to the generation of considerable quantities of discernible waste.The utilization of SUAs in biopharmaceutical production is predominantly motivated by a diminished risk of cross-contamination, decreased demands for cleaning and sterilization, heightened process flexibility, and a concomitant reduction in plant footprint and capital investments [4].
Acknowledging the undeniable environmental impact of SUA plastic waste, manufacturers in the biopharmaceutical and pharmaceutical sectors actively contribute to sustainability strategies.This involves implementing new management approaches and designing products that are easily disassembled and recyclable.This study approaches the issue from a novel perspective, emphasizing human decision-making in the selection of the production pipeline.Recognizing that diverse pipeline designs or material selections can yield the same production outcomes but with varying environmental impacts, our goal is to provide a solution that aids designers and decision-makers in opting for designs with a lower plastic impact.
Within the realm of pharmaceutical production, SUAs extensively depend on schematic drawings (SDs) for the selection of materials and installation guidance.Variations within these schematic representations may occur for identical production processes, resulting in discernible disparities in their respective plastic consumption footprints.In instances where two distinct entities, be they companies or pipeline designers, engage in similar projects or specific parts of that project's components, the design of the pipeline may diverge, employing disparate SUA products and quantities.Classifying these design variations into distinct plastic impact classes enables a systematic approach for system designers during conceptualization and decision-making to mitigate the environmental implications associated with plastic waste.
However, the manual analysis of technical drawings presents a significant and timeconsuming challenge, further complicated by the voluminous nature of these drawings in the domain and the absence of standardized data formats.Therefore, there is an essential need to standardize and optimize SUA drawings in a structured format, coupled with the systematic storage of SUA information in a database.Given the considerable volume of drawings across various manufacturers and intra-manufacturer stocks, it is crucial to develop an algorithm capable of measuring similarities among diverse designs.This development is pivotal for cultivating a more efficient and sustainable SUA management process, aiming to facilitate seamless analysis, and enhance efficiency and precision in the evaluation of these technical schematics, while concurrently mitigating environmental impact.
The structured digitization of SUAs has substantially decreased the processing time required for unstructured data.By eliminating the necessity to apply computer vision algorithms, this approach effectively addresses the challenges related to object detection and recognition components, particularly when drawings are presented in image format.Hence, in partnership with international pharmaceutical companies, we publish our structured SUA diagram dataset derived from digitized SUA schematic drawings.Our proposed solution focuses on measuring similarities within structured SUA schematic drawings, aiming to assist designers and decision-makers in selecting pipelines with a minimized impact on plastic waste.

Related Work
To measure similarities among drawings, the process involves leveraging statistical properties inherent in the data.SUA drawings lack spatial constraints, leading to an orderless nature and the potential for diverse spatial representations.This distinctive characteristic arises from the underlying structure, intricately embedded in the interconnectivity of its components, highlighting the non-Euclidean nature's significance.Consequently, effectively analyzing diagrams for pattern similarity remains a challenge, as illustrated in Figure 1.
In the context of SUA analyses, the latent pattern of similarity is embedded within the functionalities of the pipeline.This pattern is not solely dependent on the individual components comprising the assembly but also on the connectivity between them.The inherent similarity within the SUA design is shaped not only by the specific components in isolation but also by the dynamic relationships and interdependencies among these components, forming cohesive clusters that contribute to the overall functionality of the assembly.Hence, the proposed representation of these assemblies adopts a graph-based representation, where nodes symbolize individual components and edges denote the interconnectivity between them.This graph-based representation emphasizes the complex relationships among the various elements within a SUA while the utilization of this representation provides a holistic view, capturing the nuanced connections and dependencies that contribute to the overall functionality of an SUA.
In this paper, we explore the challenge of similarity-based retrieval for SUA graph networks.The motivation behind this study lies in the difficulty of determining similarity, given that subtle differences can markedly influence the functionality of two SUA pipelines.Moreover, graphs with distinct structures can still manifest the same functionality, adding an additional layer of complexity to the similarity assessment process.As exemplified in Figure 1, the structural dissimilarity between the two assemblies is evident, yet they both serve the identical functionality of a bioreactor pipeline.This example underscores the nuanced nature of similarity in the context of SUA graph networks, where functional equivalence can coexist with structural diversity.A graph network is a specific data structure widely employed in computer science and related fields, that proves invaluable for addressing issues in social networks, molecular graph structures, and biological protein-protein networks.This data structure captures interactions between objects, prioritizing relationships over the sole reliance on object properties.The representation of SUA drawings as graphs stands as a pivotal aspect of our proposed solution.This choice is driven by the assemblies' capacity to take arbitrary sizes and complex topological structures, all embedded with functional information derived from the overall structure of the assembly.
This approach offers several advantages.Firstly, graphs can capture the structural relationships and dependencies between different components in a diagram, providing a more intuitive and interpretable representation.This capability facilitates the identification of recurring subgraphs or patterns of similarities between diverse diagrams, optimizing the efficiency of data retrieval.Notably, this representation is well-suited for handling heterogeneous data, such as the information commonly found in the single-use assembly (SUA), allowing the incorporation of a wide range of information associated with each node and edge.Graphs also play a crucial role in data analysis and machine learning, aiming to make predictions or discover new patterns using graph-structured data as feature information.Examples include classifying the role of a protein in a biological context [5], recommending new friends in a social network [6], and predicting the relationship between a molecule's structure and its odor [7].
In this paper, we explore the challenge of similarity-based retrieval for SUA graph networks.The motivation behind this study lies in the difficulty of determining similarity, given that subtle differences can markedly influence the functionality of two SUA pipelines.Moreover, graphs with distinct structures can still manifest the same functionality, adding an additional layer of complexity to the similarity assessment process.
In addressing the challenge of similarity-based retrieval for SUA graph networks, our methodology encompasses the extraction of meaningful and profound information pertaining to the structure and attributes of the graphs.By focusing on both types of information, our methodology aims to capture complex patterns within the graph networks, offering a comprehensive representation that goes beyond a direct measurement of structural elements.To extract meaningful information and represent it effectively in a measurable manner, given the non-Euclidean nature of the graph, we adopt a more efficient approach.This involves embedding entire graph representations into a feature vector positioned in a low-dimensional Euclidean space R d .This method enables the transformation of complex graph structures and attributes into a Euclidean space, facilitating the quantification and comparison of complex patterns within networks.
Hence, our objective is to optimize the mapping function to accurately reflect the original structural similarities of graphs.This optimization ensures that similar graphs are positioned close to each other, while dissimilar graphs are situated farther apart, enhancing the fidelity of the representation in the low-dimensional Euclidean space.A further advantage of our proposed methodology lies in its scalability to accommodate the potential existence of an extensive repository of SUA drawings within a company's database, potentially reaching millions.This approach facilitates efficient retrieval through the utilization of precomputed graph embeddings and expeditious on-the-fly algorithms, specifically employing nearest-neighbor methods.This scalability not only optimizes the retrieval process but also aligns with the demand for the rapid access and analysis of large-scale SUA datasets, underscoring the practicality and efficacy of our proposed method within the context of extensive industrial databases.
Traditional machine learning approaches often rely on handcrafted statistics to extract structural information, such as vertex degrees or clustering coefficients [8], or graph kernel functions [9].However, these methods are constrained due to their reliance on manually crafted features, rendering them inflexible and limited in generalization across diverse tasks.Recent methods aim to automatically learn graph representations using a data-driven approach, deviating from earlier views that treated this problem merely as a pre-processing step.Nevertheless, these approaches mainly focus on extracting structural information, neglecting the attributes of nodes and edges, as in spectral clustering [10], node2vec [11], and DeepWalk [12].Furthermore, they are inherently transductive by nature, lacking the capacity to generalize to unseen nodes and other graphs [13].They also encounter space utilization inefficiencies, especially with large graphs, where learning a feature vector for each node becomes impractical [14].
Conversely, recent advances in deep learning have expanded convolutional layers to graphs, thereby substantially enhancing performance in node classification and edge prediction benchmarks [15,16].The graph convolutional network (GCN) [17] introduced a simplified approximation to spectral convolution.Subsequently, the GraphSAGE network [18] expanded upon GCN's methodology by incorporating trainable aggregation functions.This innovation was applied to sampled neighborhoods with varying depths, resulting in the acquisition of node representations.The work [19] proposed the integration of masked self-attention layers to balance the influence of neighbors on node embeddings.Additionally, ref. [20] introduced the Message Passing Neural Network (MPNN), providing a differentiable approach to seamlessly combine information from neighboring nodes.
The principle approach in graph networks involves training a model to generate individual node embeddings through a sequence of operations, encompassing transforming and aggregating node features across the entire graph.The generated node embeddings then serve as input for differentiable prediction layers, enabling end-to-end model training.Our proposed objective is to map entire SUA graph representations into a feature vector within a low-dimensional space.This mapping aims to measure similarities between SUA graphs by aggregating embedded information for all nodes and edge features, encompassing their topological characteristics embedded in their interconnectivity.
The conventional method in whole graph embedding entails pooling all encoded features, achieved through simple summation or a neural network tailored for a set of operations, akin to CNN pooling layers for grid-like data.However, directly applying these pooling operations to graphs poses challenges.Furthermore, employing global pooling operations on all embedded nodes-reducing them into a single node-neglects the hierarchical structure inherent in the graph.Recently, the introduction of top-k pooling [21] has emerged, selectively propagating only a portion of the input graph while entirely disregarding the rest.
Numerous prior studies have tackled the concept of similarity in graphs.An initial approach, SIAMESE, outlined in [22], simplifies similarity modeling by aggregating nodenode similarity scores.Another approach, presented as GCNMEAN and GCNMAX in [23], utilizes graph convolutional network (GCN) architectures alongside graph coarsening to produce embeddings at the graph level for similarity assessment.Similarly, SIMGNN from [24] endeavors to leverage node-node similarity scores by incorporating their histogram features, although its reliance on graph-level embeddings remains significant due to the non-differentiable nature of the histogram function.GMN, introduced by [25], integrates node-node similarity information into graph-level embeddings using a cross-graph attention mechanism.However, this mechanism solely updates node embeddings through cross-graph communication, resulting in the generation of one embedding per graph from the updated node embeddings.
Despite notable progress, prior research predominantly concentrated on nodes and edges, with limited attention to the generalization of graph classifications and pooling layers, particularly in the context of prominent supervised learning approaches.In addressing this gap, our work is dedicated to solving the problem of graph retrieval similarity using weakly supervised methods, which have experienced significant advancements in the field of computer vision.The weakly supervised aspect arises from the absence of explicit supervision representing the exact numerical distance or similarity between SUA graphs.However, we do possess class attributes indicating the functionality of these SUA graphs.Training a model on these class attributes to classify SUA graphs is assumed to assist the model in extracting representative features and distinguishing their representations in its latent space.Consequently, utilizing the encoded graphs' positions in the latent space is expected to provide indications of similarity and dissimilarity among graphs based on their features.
In summary, this paper contributes in three key ways: 1.
We introduce a novel dataset consisting of structured single-use assembly (SUA) drawings in the pharmaceutical domain, serving as a pioneering resource in the field of graph embedding and graph similarity problems.This dataset is designed to inspire and support further research in these domains.

2.
We showcase the efficacy of graph neural networks (GNNs) in generating graph embeddings that capture the functionality of SUA drawings within the framework of graph similarity.

3.
We validate the effectiveness of training a weakly supervised model on the graph retrieval problem, demonstrating its capability in leveraging class attributes to extract representative features and discern similarities among SUA graphs.

Experiment
In this section, we present in depth our proposed framework designed for the retrieval of single-use assembly (SUA) schematics.Our methodology unfolds through an extensive exploration of a pioneering dataset, collaboratively curated with pharmaceutical industries.This dataset stands as the first publicly available resource for SUA drawings, uniquely represented in a graph network format.
The core objective of our proposed method is the transformation of structured SUA graphs into a reduced-dimensional Euclidean space.This conversion process serves the purpose of accurately quantifying similarities among distinct SUA graphs and enabling efficient retrieval based on their functional and structural attributes.The complexity of this transformation involves mapping the graphs from a complex, non-Euclidean space to a precisely defined Euclidean space, enhancing the analytical capabilities to discern and comprehend the inherent functionalities in SUA graphs.
In the absence of explicit quantitative distance metrics between SUA graphs, our methodology relies on their functionalities to discern similarities.Therefore, the generated dataset is leveraged with annotated predefined classes associated with assembly functionality.This utilization allows us to develop our method in a weakly supervised paradigm, a well-established methodology proven effective in various computer vision applications.Within this paradigm, the model is systematically trained to embed graphs into a low-dimensional space.Subsequently, these embeddings are employed for classification purposes, enabling the model to categorize the graphs seamlessly in an end-to-end manner.This data-driven methodology empowers the model to extract representative features from the graphs, encoding them with a discernible sense of similarity.

Dataset
In collaboration with international pharmaceutical experts specializing in SUA design, we curated a comprehensive dataset of synthesized assembly graphs, encompassing a variety of components like bioreactors, tubing sets, connectors, filters, and other essential elements commonly found in pharmaceutical assembly processes.The dataset is a representative subset of the pharmaceutical pipeline frequently used in production.We believe it serves as a valuable resource for the pharmaceutical industry, offering benefits and opportunities for future researchers to advance the understanding of this domain.We have made the dataset (https://doi.org/10.5281/zenodo.10797234)available online to foster collaboration and provide a valuable resource for researchers, enabling them to explore and advance solutions in the domain of pharmaceutical assembly processes.
The SUA pipelines were generated utilizing knowledge rules supplied by experts in the pharmaceutical industrial sector.These rules outlines the fundamental structural guidelines for each SUA category, delineating the conditions governing the appearance of specific components at each level of their connectivity and their potential connections with other components in subsequent levels.The data generation process produced digitized structured SUA data that resembles the information generated in the pharmaceutical domain using SUA digital drawing software.This data could potentially be formatted in XML, representing each component along with its connectivity to others.
The structured data, with their inherent relational nature, undergo transformation into a graph network.In this representation, nodes correspond to individual components, while edges symbolize the connections or relationships between these components, as illustrated in Figure 2. The resulting graph dataset consists of undirected and unweighted graphs, free from closed cycles.We employed the knowledge base rules provided by experts to generate all conceivable valid combinations of components, resulting in a large-scale dataset of assembly graphs.Out of the various classes provided by the experts, we chose to generate data for three specific classes: Bioreactor, Mixer, and Connection.This decision was made to manage the dataset's complexity and concentrate on distinct types of functional assemblies.The Bioreactor and Connection classes have the potential to generate numerous graphs, reaching into the millions, depending on the designated depth of connectivity.However, to manage computational resources, we constrained the generation to 5 levels.Specifically, we sampled 240,000 and 300,000 graphs for the Bioreactor and Connection classes, respectively, while the Mixer class was represented by 1100 graphs in our dataset.
The imbalanced distribution in the dataset originates from the distinct characteristics and functional variations among the three classes: Bioreactor, Mixer, and Connection.While Bioreactor and Connection classes exhibit a substantial number of instances, the Mixer class is comparatively under-represented.This complexity introduces challenges during the learning process, necessitating robust strategies to ensure the model adeptly captures and generalizes the unique features of each class.Despite presenting challenges, this class imbalance offers an opportunity for researchers to innovate solutions for handling such scenarios in graph representation problems.It encourages the development of techniques capable of addressing class imbalances and improving the model's performance across diverse classes within the dataset.
Figure 3 illustrates examples from each class, offering a visual representation of the classes within the dataset.To enhance clarity, components utilized in graph generation are annotated with unique letters and color coded, facilitating easy visualization of the resulting graphs.Comprehensive details are available in the dataset repository.A brief overview of the three generated classes is provided below: Bioreactor assembly: Graphs in this class may have up to 12 lines, with lines (9, 10, 11, 12) having a 50% chance of appearing, while the remaining lines are always present in the graph.These graphs consist of a maximum of three levels, with Level 1 always starting with component (B), which can be connected to 12 lines.
Mixer assembly: Graphs in this class consist of up to 5 lines, with lines (3, 4) having a 50% chance of appearing, while the other lines are always present.These graphs have a maximum of 3 levels, and Level 1 always initiates with the component (M), which can be connected to 5 lines.
Connection assembly: Graphs in this class exhibit a distinct structure compared to the previous two.The structure is not defined by the number of lines but rather by the possible connected nodes and their potential connections.These graphs can have up to 5 levels.Despite the vast number of potential graphs in this class, we randomly selected 1% of them during the Level 4 generation step.
The possible connectivity with respect to the remaining components depends on each component's properties and its potential connections.Detailed information is available in the dataset repository.

Method
Pharmaceutical companies often maintain extensive databases of SUA drawings.Upon digitizing these drawings into a structured data format, our proposed model maps these structured representations into embedded vectors within a Euclidean space.This process results in the database incorporating a condensed rendition of SUA drawings, manifested as a latent vector capturing the distinctive features of the graph structure and attributes.In this context, given a graph G i = (V, E), our objective is to have a model represent it as a compressed vector X i in the latent space.Here, each graph G i is expressed as a set of nodes V and a set of edges E. In the context of SUA drawings, these are heterogeneous graphs, implying that each node i ∈ V is associated with a feature vector m i .
To extract more complex structural features inherent in the SUA graphs networks, we employed a graph convolutional network (GCN) architecture with a specific focus on GraphSAGE (Graph Sample and Aggregation) layers [18].The GraphSAGE model is designed to operate in an inductive learning framework, making it particularly suitable for scenarios where the graph structures may evolve or new graphs are introduced.
The proposed graph convolutional network (GCN) is constructed with three layers of GraphSAGE, each contributing to the hierarchical processing of information.In each layer, the model systematically aggregates the features of a node's connected neighbors using one of the various aggregation methods, including maximum, average, or a fully connected layer.The aggregation function, as represented in Equation ( 1), aggregate all the embedding vectors h u for all the nodes u in the immediate neighborhood of the target node, denoted as node v. N(v) denotes all the neighbors of node v.This results in the aggregated representation, denoted as a v , for the node v.
Subsequently, the aggregated features of the node are transformed, updating the node's own features based on both its existing features and those derived from its connectivity.As illustrated in Equation ( 2), this function computes the updated representation for node v denoted as a v by incorporating its neighborhood's aggregated representation and the node's previous representation h k−1 v .The parameter k represents the stage or layer of the model.This iterative process facilitates the gradual accumulation and integration of information from neighboring nodes, effectively capturing the underlying structural properties embedded within the graph.The utilization of GraphSAGE layers contributes to the model's ability to discern complex patterns and relationships within the graph network, enhancing its representation learning capabilities.
Our ultimate goal is to encode an entire graph into a single latent vector.Throughout the forward pass of our model, this graph representation encoding is achieved using top-k pooling, a method validated in the graph-based model Graph U-net [21].In the top-k pooling operation, applied during the layer-wise propagation of the network, the objective is to select the top-k nodes from the graph based on their scalar projection values onto a projection vector p ′ .This operation involves ranking the nodes according to their scalar projection values, obtaining the indices of the k-largest values, and subsequently extracting the corresponding rows and columns from the adjacency matrix A ′ and feature matrix X ′ .This approach allows the trainable projection vector p ′ to adapt during backpropagation, enhancing the flexibility and performance of the pooling layer in capturing salient features within the graph structure.
To foster the interplay of information between nodes and layers, the pooled features from each layer were concatenated.This concatenation mechanism proved instrumental in providing a seamless flow of information through the model architecture.The top-k pooling and concatenation processes were iteratively applied across the three layers of the model, thereby synthesizing a final encoded vector.The resultant encoded vector in an X-dimensional format encapsulates a condensed yet comprehensive representation of the entire graph, facilitating efficient and effective downstream tasks such as classification, similarity assessments, and retrieval.
To facilitate the downstream task of classification, which serves as a form of weak supervision for the problem of graph retrieval, the embedded vector is further processed through two fully connected layers.This classification step enhances the model's ability to encode graphs in the latent space in a manner that reflects their similarity or dissimilarity.Specifically, similar graphs are encouraged to be encoded in close proximity to each other in the latent space, while non-similar graphs are pushed farther apart.The classification layers leverage the discriminative features captured during the initial encoding, allowing the model to distinguish and classify graphs based on their functional attributes and structural characteristics.This dual-stage process not only supports graph retrieval but also strengthens the model's ability to learn meaningful representations that align with the inherent properties the SUA graphs.
Due to the imbalance in the dataset classes, training the model on them may lead to biased learned parameters favoring classes with a higher number of instances, as they are more prevalent during training.This might result in the model not appropriately assigning weight to or neglecting classes with fewer examples.To address this issue, potential solutions include augmenting the under-sampled class or introducing weight to the learning loss during model training.Given that we are working with a carefully crafted dataset following specific knowledge base rules, traditional data augmentation methods, such as randomly dropping or adding components or connections, are not applicable.These methods may introduce inapplicable data or create false graphs within the class structure.Consequently, we explored two alternative approaches.The first involves introducing weights to the training loss and the second entails grouping the small class (Mixer) with the Bioreactor class, forming a binary classification task against the third class during model training and employing a light weighted loss to stabilize the training.
After completing the training phase and achieving convergence on the classification task, the model encodes the entire training dataset.From this encoding process, the model then extracts final vector representations.These vectors are utilized as elements within the dataset library, which, in turn, plays a crucial role during the inference phase.The dataset library facilitates the identification of similarities among various items based on their respective encoded vector representations.
To confirm that the model-encoded graphs align with the concept of similarities relevant to the specified task, we visualized the embedded training graphs.This visualization was accomplished through the utilization of the Principal Component Analysis (PCA) algorithm, a dimensionality reduction technique employed to map high-dimensional data into a lower-dimensional space.As illustrated in Figure 4a, the model effectively clustered the training graphs based on their functionally structural similarities, identifying subgroups within each class that shared common functional features in the latent space.The visualization also demonstrated the model's ability to separate graphs belonging to different classes into distinct clusters.
It is noteworthy that two classes were merged into one (Bioreactor and Mixer) during training but were separated into distinct clusters in the latent space.This validates the assumption that the model was able to extract distinctive features from graphs representing their functionalities.Even though we trained the model not to distinguish between these two classes, the model was able to separate them into different clusters in the encoded latent space.To further validate this, we projected the validation set of graphs using the trained graph network model and the PCA projection model, as shown in Figure 4b.The model demonstrated its efficiency in grouping unseen graphs into clusters similar to those in the training set and overlapping in their latent space.

Results
In this section, our objective is to quantitatively assess the similarity between graphs.Benefiting from the dynamic functionality of the graph neural network, adept at mapping SUA graphs from their intrinsic non-Euclidean nature to the well-established Euclidean space, we chose to employ the Euclidean distance measure.This measure was systematically applied between the encoded vector representation of a test graph and the stored encoded vectors within the entire training graph database.By utilizing the Euclidean distance, we aimed to capture the geometric relationships between vectors in the encoded space, facilitating an effective evaluation of graph similarities.
In our proposed method, the encoded graphs are represented by 32-dimensional vectors.Various Euclidean distance functions, including Mean Square Error (MSE), Mean Absolute Error (MAE), Root Mean Square Error (RMSE), or even cosine distance, can be employed to measure the similarity between these vectors in the high-dimensional space.As part of our approach, we utilized the Principal Component Analysis (PCA) algorithm to project the encoded vectors, thereby further reducing their dimensionality.PCA functions by identifying the principal components, or directions, along which the data exhibits the most significant variability.By projecting the vectors onto these principal components, we not only compressed the data to focus on the most relevant features but also aimed to enhance the efficiency of the distance calculation.Additionally, reducing the dimensionality contributes to a more compact representation of the stored graph database, optimizing storage requirements and computational efficiency.
The calculation of Euclidean distance between vectors allowed us to quantitatively assess the similarity between graphs and identify the most corresponding pairs.In this context, a smaller distance between vectors indicates a higher degree of similarity, providing a quantitative measure of their functional resemblance.It is essential to note that alternative distance metrics or dimensionality reduction techniques, such as PCA, can also be explored for this purpose.This comprehensive methodology not only facilitates the identification of the most corresponding graphs but also generates a list of possible candidates.This list serves as a valuable resource for users, offering insights into potentially similar functions among the saved SUA drawings.Such informed decision-making contributes to minimizing the environmental impact associated with the selection process, aligning with sustainability goals.
During the testing phase, we utilized real-world data obtained from pharmaceutical companies, representing authentic SUA assemblies.These real-world SUA assemblies were not part of the model's training data and did not follow the same conceptual framework as the synthesized dataset used for training.The real-world data was encoded using the trained model and transformed into low-dimensional vectors using the PCA model.Subsequently, these encoded vectors were compared against the entire saved graph dataset.
It is crucial to note that establishing a quantitative ground truth for graph similarity, encompassing precise measures of similarity or dissimilarity, is inherently challenging and not feasible.The only available quantitative attribute during training is the class assignment, which lacks representation in the testing data.This is particularly pertinent as the testing data comprises graphs associated with different functionalities, not preassigned to classes.To address this challenge, we sought input from pharmaceutical human experts to determine whether the retrieved graph candidates from the model accurately represented the most similar cases to a given input.The expert assessments served as a qualitative measure of similarity, providing valuable insights into the model's performance in capturing functional resemblances.
After subjecting the retrieved graphs to a qualitative analysis, the model adeptly identified the top X similar graphs, as illustrated in Figures 5 and 6.In this investigation, we showcase real-world data examples employing the proposed model.We retrieved the top three similar graphs from the stored database using the Euclidean distance metric.Through expert qualitative evaluation, it was observed that graphs with similar functionality were projected closer to their corresponding counterparts from the training set.This proximity indicates a match in their functional behavior, validating the model's ability to discern and retrieve functionally similar graphs.Notably, some of these nodes were never encountered during the training phase, showcasing the model's ability to identify similar functional graphs within the dataset even when presented with previously unseen data.This underscores the model's capability to generalize and discover graphs with similar structural and functional characteristics.The inductive learning approach of GraphSAGE is pivotal, as it allows the model to generalize effectively to previously unseen SUA graphs.This is particularly beneficial in real-world applications where new assembly graphs may be introduced over time.By leveraging the inductive capabilities of the GraphSAGE model, our approach ensures a robust representation of SUA graphs, facilitating accurate similarity assessments and retrieval even in dynamic and evolving graph structures.

Conclusions
This project addresses a pressing environmental issue, namely, the generation of plastic waste in pharmaceutical assembly processes.By introducing a unique dataset specifically tailored for pharmaceutical structured single-use assembly (SUA) drawings, we pave the way for delving into the realm of graph embedding techniques.The incorporation of graph convolutional networks, coupled with a weakly supervised learning approach, results in a remarkable efficiency in encoding SUA graphs into a compact, low-dimensional space.
The proposed method stands out for its ability to showcase the model's effectiveness in retrieving the most similar SUA drawings, thereby demonstrating its practical utility.As we apply this model to real-world pharmaceutical data, its successful performance underscores its potential for tangible implementation.This innovative solution not only addresses the immediate challenge of plastic waste but also opens avenues for future researchers.They can build upon this foundation to optimize existing solutions or explore alternative encoding spaces, using this experiment as a baseline model for further advancements.The broader implication of this work is a step towards fostering sustainable practices in the pharmaceutical industry and contributing to ongoing efforts for a greener and environmentally responsible future.

Figure 1 .
Two distinct SUA configurations (a,b) of a Bioreactor pipleline showcasing similar functionality, illustrating the complexity of structures, with the objective of quantifying similarities between them.Different components in the assembly are represented by letters and colour codes.Letters are defined in the dataset description.

Figure 2 .
An illustrative example depicting the conversion of an SUA pipeline drawing into a graph network.(a) Structured SUA drawing.(b) SUA graph.C: culture, P: pumping, W: storage, S: sampling, P: connecting, F: filtering, Y: mixing.

Figure 3 .
Three graph samples, each from a different class, highlighting distinctions in both structure and functionality.Letters and color codes denote various components.(a) Bioreactor class.(b) Mixer class.(c) Connection class.

Figure 4 .
PCA projection of the training set and validation set demonstrates the model's ability to encode representative functional features and cluster them based on their functional structure.The graph presents the two merged training classes instead of three.(a) PCA projected encoded training set, (b) PCA projected encoded test set.

Figure 5 .
Figure 5.The first graph represents the test input, while the subsequent three graphs are retrieved from the database.(a) Top 3 similar retrieved graphs to example CNC 0008.(b) Top 3 similar retrieved graphs to example CNC 0060.(c) Top 3 similar retrieved graphs to example CNC 0066.

Figure 6 .
Figure 6.The first graph represents the test input, while the subsequent three graphs are retrieved from the database.(a) Top 3 similar retrieved graphs to example SMP 0002.(b) Top 3 similar retrieved graphs to example CNC 0016.(c) Top 3 similar retrieved graphs to example CNC 0024.