A Topic Modeling Approach to Discover the Global and Local Subjects in Membrane Distillation Separation Process

: Membrane distillation (MD) is proposed as an environmentally friendly technology of emerging interest able to aid in the resolution of the worldwide water issue and brine processing by producing distilled water and treating high-saline solutions up to their saturation with a view toward reaching zero liquid discharge (ZLD) at relatively low temperature requirements and a low operating hydrostatic pressure. Topic modeling (TM), which is a Machine Learning (ML) method combined with Natural Language Processing (NLP), is a customizable approach that is ideal for researching massive datasets with unknown themes. In this study, we used BERTopic, a new cutting-edge Python library for topic modeling, to explore the global and local themes in the MD separation literature. By using the BERTopic model, the words describing the collected dataset were detected together with over-and underexplored research topics to guide MD researchers in planning their future works. The results indicated that two global themes are widely discussed and are relevant to MD scientists abroad. In brief, these topics are permeate ﬂux, heat-energy recovery, surface modiﬁcation, and polyvinylidene ﬂuoride hydrophobic membranes. BERTopic discovered 62 local concepts. The most researched local topics were solar applications, membrane scaling, and electrospun membranes, while the least investigated were boron removal, dairy efﬂuent applications, and nickel wastewater treatment. In addition, the topics were illustrated in a 2D plane to better understand the obtained results.


Introduction
The demand for fresh water continues to increase due to the rapid growth in the human population and other issues related to accelerated industrialization, environmental impacts, climate change, altered consumption patterns, etc., which lead to water stress and scarcity. Over the last century, the global water demand has expanded by 600%, and this rate equates to an annual increase of 1.8%. As a result, there is an urgent need for effective water management and conservation practices as well as for the development of developing creative environmentally friendly methods and solutions to the said global water crisis [1][2][3]. Membrane distillation (MD), a non-isothermal separation technology, has been proposed for clean water production and treatment of high-salinity waters up to their saturation, thus allowing the management of brines discharged from other water-processing plants and seeking to achieve zero liquid discharge, which represents an important environmental benefit. The driving force in MD is the water vapor pressure difference established at both sides of a hydrophobic porous membrane [4,5]. Among the various advantages of the MD separation process, one can highlight its potential to overcome the osmotic pressure Many methods have been developed to find existent latent topics in a given dataset. Non-Negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA), probabilistic LSA, and Latent Dirichlet Allocation (LDA) are among these techniques [35]. In 2022, a new state-of-the-art model-BERTopic-was introduced by Maarten Grootendorst [36] for topic modeling as a Python library. This generalized model for pretrained sentence transformers has yielded promising results for topic modeling in a variety of domains [37]. The BERTopic model assigns only one topic per document and aids in the identification of outlier documents that are difficult to classify, resulting in an improved classification accuracy [38]. The BERTopic model is based on six phases (the first five steps are mandatory, while the last step is optional). A related illustration can be seen in Figure 1 [39].
vised learning methods are data-driven approaches that reveal internal data characteristics and laws through the learning of unlabeled data sets. These approaches excavate the internal features of the data in greater depth, making it more conducive to extracting discriminative features [33,34].
Many methods have been developed to find existent latent topics in a given dataset. Non-Negative Matrix Factorization (NMF), Latent Semantic Analysis (LSA), probabilistic LSA, and Latent Dirichlet Allocation (LDA) are among these techniques [35]. In 2022, a new state-of-the-art model-BERTopic-was introduced by Maarten Grootendorst [36] for topic modeling as a Python library. This generalized model for pretrained sentence transformers has yielded promising results for topic modeling in a variety of domains [37]. The BERTopic model assigns only one topic per document and aids in the identification of outlier documents that are difficult to classify, resulting in an improved classification accuracy [38]. The BERTopic model is based on six phases (the first five steps are mandatory, while the last step is optional). A related illustration can be seen in Figure 1 [39]. As can be seen in Figure 1, the first step of the procedure is document embedding, which is the core of the sequence in a text-based intelligent system. It is the process of converting a textual input into a numerical array (vector) form for applying ML models, and it uses structure-preserving maps to capture informative representations from highdimensional observations [40][41][42]. BERTopic makes use of any transformer-based language models that have been previously trained. The following second step is a Dimensionality Reduction (DR) approach [37]. DR techniques are applied to reduce the number of input features in a set of data, which becomes more compact, improving the efficiency of the learning algorithm [43]. DR can help users reduce the data storage space, decrease the computational time of ML algorithms, and help visualize multidimensional data in lower dimensions such as 2D or 3D [44,45]. Many unsupervised Dimensionality Reduction methods, such as Multiple Dimensional Scaling (MDS), Principal Component Analysis (PCA), Locally Linear Embedding (LLE), Isometric Mapping (ISOMAP), and Uniform Manifold Approximation and Projection (UMAP), have been proposed in the literature [46]. The third step includes the algorithm clusters of the reduced embeddings [47]. Clustering in ML is a dynamic technique of categorizing data into numerous collections or clusters based on the similarities of the data points' characteristics and features [48,49]. Conventional clustering techniques are known as "unsupervised", indicating that no information about data point partitioning or outcome variables is available [50]. Clustering approaches are categorized into two types: hierarchical and partitional. Hierarchical clustering attempts to form a tree-like layout of classes and partition occurrences in each node of the tree, whereas partitioning clustering categorizes occurrences effectively into k clusters [51]. The algorithms used for clustering can be k-means, spectral, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), and Ordering Points to Identify the Clustering Structure (OPTICS). These algorithms can detect the underlying structures in As can be seen in Figure 1, the first step of the procedure is document embedding, which is the core of the sequence in a text-based intelligent system. It is the process of converting a textual input into a numerical array (vector) form for applying ML models, and it uses structure-preserving maps to capture informative representations from highdimensional observations [40][41][42]. BERTopic makes use of any transformer-based language models that have been previously trained. The following second step is a Dimensionality Reduction (DR) approach [37]. DR techniques are applied to reduce the number of input features in a set of data, which becomes more compact, improving the efficiency of the learning algorithm [43]. DR can help users reduce the data storage space, decrease the computational time of ML algorithms, and help visualize multidimensional data in lower dimensions such as 2D or 3D [44,45]. Many unsupervised Dimensionality Reduction methods, such as Multiple Dimensional Scaling (MDS), Principal Component Analysis (PCA), Locally Linear Embedding (LLE), Isometric Mapping (ISOMAP), and Uniform Manifold Approximation and Projection (UMAP), have been proposed in the literature [46]. The third step includes the algorithm clusters of the reduced embeddings [47]. Clustering in ML is a dynamic technique of categorizing data into numerous collections or clusters based on the similarities of the data points' characteristics and features [48,49]. Conventional clustering techniques are known as "unsupervised", indicating that no information about data point partitioning or outcome variables is available [50]. Clustering approaches are categorized into two types: hierarchical and partitional. Hierarchical clustering attempts to form a tree-like layout of classes and partition occurrences in each node of the tree, whereas partitioning clustering categorizes occurrences effectively into k clusters [51]. The algorithms used for clustering can be k-means, spectral, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), and Ordering Points to Identify the Clustering Structure (OPTICS). These algorithms can detect the underlying structures in image, text, or video [48,52]. Subsequently, in step 4, the BERTopic model will tokenize and vectorize documents. The quality of topic representations is critical in topic modeling for interpreting topics, communicating results, and understanding patterns. It is critical to ensure that the topic representations are appropriate for a given case [39]. Tokenization is the process of breaking down text strings into small chunks such as words and phrases [53]. Because ML models only accept matrices as inputs, the unstructured data must be converted to vectors. The technique of translating text into numerical form is known as "text vectorization". In this case, Term Frequency-Inverse Document Frequency (TF-IDF), Doc2Vec, and CountVectorizers are commonly used vectorizers for textual data [54,55]. The fifth step is necessary to obtain an accurate representation of the discovered themes and to reflect the important words in the clusters. A Combined Term Frequency-Inverse Document Frequency (c-TF-IDF) method is used. c-TF-IDF, a modified version of TF-IDF, considers what distinguishes documents in one cluster from documents in another. Finally, in the last step, the representation of the topics is the fine-tuning. This optional step allows users to represent the concepts with more unique keywords. The BERTopic architecture offers a wide spectrum of options for fine-tuning that ranges from KeyBERT-like models to GPT-like models [39].
MD research started in 1967 when the first paper, "Vaporization through Porous Membrane", was published by Findley, M. E. [56]. It has a widespread 56-year history with thousands of manuscripts written by many worldwide researchers. This non-isothermal separation process has been discussed from numerous perspectives, such as laboratory experiments, the production of innovative membranes, system improvement, theoretical modeling, and optimizations, among others. In our previous study, trending topics in MD literature were identified via bibliometric methods, Text Mining approaches, and manual searching [14], but none of the published papers revealed the research concepts in MD with a recently developed, state-of-the-art AI model. This paper employed the BERTopic algorithm to discover the most attractive and interesting MD research subjects based on the provided abstracts of the articles downloaded from the Scopus database. Several insights about MD can be revealed using the topic modeling approach: (i) the predominantly handled and the less discussed topics by MD researchers; (ii) not only the information on MD topics can be provided, but also information regarding the prominent or lagging MD configurations, membrane types, and modeling approaches, among others, on any given topic (thanks to the list of topic terms created to enable in-depth exploration of the topic terms; and (iii) a topic modeling of MD literature can reveal new perspectives or insights that might not have been noticed before by identifying the gaps. One can easily identify the relationships between different techniques and applications by looking to the BERTopic results. In addition, this kind of approach to MD helps to guide scientists to carry out pioneering and cutting-edge MD research studies. As a result, topic modeling is a valuable and effective tool for improving knowledge and innovation in MD. This study also includes a detailed description and application of the BERTopic procedure, which we believe will inspire further studies on MD or in any other scientific domain.

Data
In this research, we used an MD dataset downloaded 23 January 2023 from the Scopus database. This database was chosen because it is well known in the academic community for its broad, inclusive, and comprehensive content coverage [57,58]. The search criteria and the keywords can be seen in Tables 1 and 2, respectively. In addition, those articles that were not found in the search results were manually added to the dataset. The collection was then manually screened to remove irrelevant documents and articles that did not have abstracts. The final dataset included 3684 articles.

Methods
The BERTopic architecture was used to reveal the hidden themes in the MD domain. BERTopic used 5 consecutive ML approaches to uncover the topics in the collection. The first approach was to transform the textual data into numerical representations (text embedding). The BERTopic architecture has a structure that allows many different pretrained embedding models. In this study, the selected embedding model was all-mpnet-base-v2 because it was the highest-quality model at the time of the research (i.e., the highest performance on sentence embedding for 14 different datasets with the highest average performance). all-mpnet-base-v2 is an all-around model optimized for a wide range of applications. Over 1 billion training pairs were used to train this model on a huge and diverse dataset. all-mpnet-base-v2 uses a mean pooling approach with normalized embeddings, and it converts a data instance into a 768-dimension (feature) numerical array [59].
As stated earlier, after the conversion of the dataset to numerical representations, the BERTopic library applied a DR technique. For this step, UMAP was used. This is a non-linear method that reduces dimensionality using manifold learning and topological data analysis. When reducing dimensionality, UMAP preserves the data's local and global structure, which is critical for capturing the semantics of textual data. The data is compressed to low dimensions (mostly 2D or 3D) by attempting to minimize the cross-entropy (CE). The basic calculation of CE is as follows (Equation (1)) [60]: where A is the weighted adjacency matrix derivation of z, and µ and υ represent two types of probabilities. Here, z ( z = {z 1 , z 2 , . . . . . . , z n }, z n ∈ R M is the lower-dimensional representation of the high-dimensional dataset x. The details of the UMAP calculations can be found elsewhere (McInnes et al. [61]). The data were then fed to a clustering algorithm for segmentation. The HDBSCAN method was used for the data clustering. HDBSCAN is an updated version of DBSCAN with varying epsilon values (ε) that integrates the results to identify the optimal clustering for stability across the epsilon. This enables HDBSCAN to detect clusters of various densities and to be more resilient in parameter selection. The HDBSCAN clustering algorithm exhibits important advantages over other clustering algorithms, as it produces a separate cluster for outliers, reduces the amount of noise in the clusters, and determines the number of clusters automatically [62]. The computational path of HDBSCAN starts with the two hyperparameters that the algorithm needs for clustering: ε, the distance scale; and k, the minimum number of points. X = {X 1 , . . . . . . , X n } is a set of points in a metric space with the Euclidian distance d and X i ∈ X. X i , the core point of cluster i within ε, is at least equal to k, as indicated in Equation (2) [63,64]: where B is the open ball radius. In addition to X i and X j , the two arbitrary points are considered as ε − reachable, depending on ε and k (Equations (3) and (4)): When all the data instances in a cluster are connected, a cluster is formed. The HDBSCAN algorithm uses a modified distance metric. The core-distance (κ(X i )) is the distance of X i to its k th nearest neighbors, and the equation that describes the mutual reachability distance between X i and X j d mreach X i , X j is as follows: The detected outliers are moved further away from clusters by the mutual reachability distance. By applying the traditional Single Linkage Clustering algorithm to the discrete metric space, the hierarchical clustering of X is established. A clustering on density variation can be used to discover regions with the highest density inside a point cloud, where the local density at each point is calculated by estimating the core-distance value associated with each point (Equation (6)): The cluster tree's hierarchy can be reduced by recursively merging some of the clusters. The cluster tree is condensed by taking the minimum permissible cluster size (m) and only admitting the pruning of a cluster that would not endure against the increment in the λ value into at least two subsets with sizes greater than m. According to this method, the stability (σ) of an individual cluster is defined by adding the range of λ values for each point in the cluster, as written in Equation (7): In this case, λ max,C i X j and λ min,C i X j are the bounds of λ over the point X j in the cluster C i . To obtain the best clustering attribution among all conceivable clustering results, the overall persistence score in all selected clusters should be maximized while considering the constraint of no cluster overlap. Clusters with the highest total persistency were chosen for this purpose as indicated by the following equation: where I is the subset of the total number of clusters (n). For all i, j ∈ I and i = j, Equation (8) is limited by Equation (9): Separations 2023, 10, 482 7 of 23 In the next step, the data were tokenized and vectorized. BERTopic uses CountVectorizer as the default for these purposes. First, each textual data instance is tokenized, and then the text is converted into features (F). If there are (D) documents and (F) features, CountVectorize will convert them into a D x F matrix. The values in the matrix represent the frequency of each feature [65].
In the fifth step, the BERTopic method allowed the representation of topics of the clusters based on the c-TF-IDF approach. Equation (1) was used to calculate the c-TF-IDF (W x,c of a single word. For the term x within class c, the c-TF-IDF score can be calculated using Equation (10) [36]: where t f x,c is the frequency of the word x in a class c, f x is the frequency of the word x across all classes, and A is the average number of words per class. In the last step, which is optional, the researcher can apply a technique called Maximal Marginal Relevance (MMR) to reduce the word repetition and increase keyword diversity (fine-tuning of topic representations). MMR, a good approach to present non-redundant information, considers the similarity of key words within the document as well as the similarity of previously picked phrases. MMR is defined as follows (Equation (11)) [66]: where D represents the sentence collection, F is the feature set, and S is the subset of sentences in D that have already been chosen. In D, R/S is the set of unselected sentences and λ is the diversification constant, which is a float (0-1). Sim 1 measures the similarity between a sentence and a feature, whereas Sim 2 measures the similarity between two phrases. Apart from the 6 steps mentioned above, BERTopic can create different plots to interpret the obtained results. The most important of these plots is the heatmap of the topic's similarity. The heatmap created by the model is based on cosine similarity (CS). The basic idea behind cosine similarity is to compute the cosine value of the angle between two vectors to demonstrate their similarity. CS ranges between −1 and 1. The cosine similarity value is equal to 1 when two vectors point in the same direction and −1 when they point in opposite directions. For two vectors X V = {x 1 , . . . . . . , x n } , Y V = {y 1 , . . . . . . , y n } , the CS can be computed as follows (Equation (12)) [67]:

Outline of the MD Dataset
Before diving in the MD topic modeling results, it is critical to provide key information about the dataset, since it helps to comprehend the nature and quality. MD separation is a promising non-isothermal technology in the field of desalination and water treatment. It is a subject of increasing interest worldwide for numerous research groups. The annual publication count can provide an indication of the degree of research activity in a specific collection and can be used to track its growth and advancement over time. The evidence of this rising attention is shown in Figure 2.
As shown in Figure 2, the number of MD publications has increased exponentially, particularly since 2012. A total of 473 articles were published in 2022, the last year of the dataset. The total number of articles (3684) indicates that membranologists are making active efforts to produce appropriate MD membranes with improved performance and high thermal efficiency to optimize MD systems. However, as the number of publications increases, the question regarding which topics are being covered more and which are being in a specific collection and can be used to track its growth and advancement over time.
The evidence of this rising attention is shown in Figure 2. As shown in Figure 2, the number of MD publications has increased exponentially, particularly since 2012. A total of 473 articles were published in 2022, the last year of the dataset. The total number of articles (3684) indicates that membranologists are making active efforts to produce appropriate MD membranes with improved performance and high thermal efficiency to optimize MD systems. However, as the number of publications increases, the question regarding which topics are being covered more and which are being ignored becomes more important. For this reason, it is extremely informative to determine those MD topics of great interest for future studies using the BERTopic architecture.
The distribution of some significant metrics of the domain was revealed. The advantage of using a violin plot to visualize the distribution of numerical values is that it can help to summarize the basic statistics as well as to show the density of each variable. Figure 3 illustrates the violin plots of the MD publication year, citation, page count, and reference count values of the domain; the main statistical values (min, max, mean, median, outliers, first quartile, and third quartile) are presented. The wide areas in the figures reflect more frequent data points, while the thin sections represent the less frequent data points. The distribution of some significant metrics of the domain was revealed. The advantage of using a violin plot to visualize the distribution of numerical values is that it can help to summarize the basic statistics as well as to show the density of each variable. Figure 3 illustrates the violin plots of the MD publication year, citation, page count, and reference count values of the domain; the main statistical values (min, max, mean, median, outliers, first quartile, and third quartile) are presented. The wide areas in the figures reflect more frequent data points, while the thin sections represent the less frequent data points.     increase in MD studies was initiated. The mean (2015.85) and median (2018) values of the publication years are also evidence of this interest. When the dataset was divided in half using the median value, the sum of the studies conducted before 2018 was equal to the number of studies conducted during the last 4 years. When the number of citations received by MD articles was analyzed (Figure 3b), it could be seen that an article received 33 citations on average, which is reasonably good. There are articles in the collection that had not been cited at all (i.e., zero citations), but it should be noted that this was to be expected for those articles published in the last months of 2022. It is clear that Figure 3b is left-skewed (which is the same as right-tailed), indicating that there were articles with more than 150 citations, and these publications were seen as outliers (i.e., very impactful papers with more citations than expected) in the dataset. Figure 3c shows the violin plot of the number of pages and page distributions of the articles in the MD domain. In Figure 3c, it can be seen that the articles with more than 30 pages protrude to the right in the graph. While most of the publications contained between 8-12 pages, the average value was~11. Finally, when the violin plot of the reference count was examined (Figure 3d), the data were better distributed than the other features when looking at the width of the green region. While the average reference value was~42, the median value (40) indicated that the number of articles using more than 40 references was equal to the number of articles using less than 40 references.

Terms Defining the MD Domain
In the BERTopic model, analysts can find the themes that define global topics and indicate specialized themes (i.e., local topics). The most important parameter in the BERTopic architecture to create this variance is min_cluster_size (in HDBSCAN algorithm), which is the primary parameter that affects the resulted clustering. Ideally, this is a simple option to configure the lowest-sized grouping (i.e., the number of clusters that will be generated) for which researchers want to consider a cluster. Increasing this parameter results in fewer but larger clusters, whilst reducing this value results in more microclusters [68,69].
To find the words that describes the MD domain, the min_cluster_size parameter was set to 3684, which naturally resulted in one cluster. The indicated cluster was described with the following words ( Table 3) that also defined the collection. Furthermore, the dataset was illustrated in a two-dimensional space (Figure 4).  Table 3 shows the terms mostly used by MD community (i.e., words describing the published research studies) and the related improvement efforts to make this technology a worldwide separation process. The term "dimension" in this figure depicts a feature that describes an aspect of the data. Dimensions 1 and 2 represent the most important features of 768 dimensions created in the first step of the algorithm (text embedding) with the all-mpnet-base-v2 model and then reduced to two in the second step with the UMAP algorithm. In fact, researchers involved in MD are already familiar with these terms. A typical MD system consists of a high-temperature feed channel and a low-temperature permeate channel in which the vapor flux is driven by a temperature difference across a hydrophobic and microporous membrane. The vapor flux is collected in the permeate side of the membrane. The permeate flux is an essential indicator of a given MD system's performance, since it reveals how efficiently the system produces distilled water [70,71]. In addition to the very high rejection factor, MD researchers pay special attention to the permeate flux enhancement. In a typical MD configuration, there is the feed side where the aqueous solution to be treated is heated; then the permeate (i.e., distillate) is collected either at the permeate side of the membrane inside the membrane module (e.g., DCMD), on a condensation cold plate (e.g., AGMD), or outside the membrane module (e.g., VMD, SGMD); while the non-volatile components are retained by the membrane [72]. MD research studies are mostly conducted for desalination of seawater or brackish water, although other aqueous feed solutions such as pharmaceutical, radioactive, etc., have also been considered [14]. The flow rate, chemical composition, temperature, and other properties of feed solutions are the most studied parameters in MD because of their important effects on MD performance [73,74]. In fact, the temperature of both the feed and permeate solutions are the main parameters affecting MD performance, especially the MD permeate flux and the thermal efficiency [75]. The rate of water vapor transport through the MD membrane is directly related to the transmembrane temperature, since the water vapor partial pressure difference (the driving force) is caused by the temperature difference at both sides of the membrane [76]. In addition, this driving force is also affected by both the temperature and feed concentration polarization effects, which are important phenomena that negatively affect the performance of MD systems. The induced concentration and temperature boundary layers at both membrane surfaces reduce the water vapor transfer by decreasing the temperature difference between the two sides of the membrane and increasing the energy consumption [77]. The fact that MD scientists frequently employ the word "performance" is proof that their main goal is to improve this parameter, as they refer to both the permeate flux and rejection factor. The improvement in MD performance over the years is also an indicator of the considerable efforts made to render this technology one of the leading membrane technologies for water production in the near future.
Separations 2023, 10, x FOR PEER REVIEW 10 of 24 generated) for which researchers want to consider a cluster. Increasing this parameter results in fewer but larger clusters, whilst reducing this value results in more microclusters [68,69].
To find the words that describes the MD domain, the min_cluster_size parameter was set to 3684, which naturally resulted in one cluster. The indicated cluster was described with the following words ( Table 3) that also defined the collection. Furthermore, the dataset was illustrated in a two-dimensional space (Figure 4).   Table 3 shows the terms mostly used by MD community (i.e., words describing the published research studies) and the related improvement efforts to make this technology a worldwide separation process. The term "dimension" in this figure depicts a feature that describes an aspect of the data. Dimensions 1 and 2 represent the most important features of 768 dimensions created in the first step of the algorithm (text embedding) with the allmpnet-base-v2 model and then reduced to two in the second step with the UMAP algorithm. In fact, researchers involved in MD are already familiar with these terms. A typical MD system consists of a high-temperature feed channel and a low-temperature permeate As can be seen in Figure 4, the documents occupied a large area in width in the two-dimensional plane, and the presence of different clusters in the form of islets was evident. Even this illustration can allow researchers to manually interpret that there may be quite many topics in the domain.

Global MD Subjects
In a second analysis, the min_cluster_size parameter was set to 1000, and the global concepts and their descriptive words were depicted. The resulting number of topics was two. The top 10 words for the topics and the distribution of the global themes in a 2D plane can be seen in Table 4 and Figure S1, respectively. Note that the typical behavior of the HDBSCAN approach is that it creates outliers (data points that do not fit into any topic). The HDBSCAN model creates outliers in clustering because forcing outliers into a cluster reduces the intercluster homogeneity and consistency (the BERTopic architecture aggregated the outliers together as Cluster -1). The outlier's cluster is also specified in Table 4 and highlighted in light grey in Figure S1. Table 4. Global topics in MD domain (min_cluster_size = 1000).

T-1 (Outliers) 410
membrane-water-distillation-processconcentration-membranes-fluxtemperature-feed-using T1 2121 membrane-water-distillation-feed-mdflux-temperature-process-heat-energy T2 1153 membrane-membranes-surface-waterdistillation-flux-PVDF-contact-MD-hydrophobic In Table 4, one can easily understand the global subjects (i.e., the main research topics) of MD. The MD studies in Cluster 1 (T1) included efforts to reach global goals and solve the main problems of MD. The number of articles in this cluster (2121) represents the main objectives of MD researchers to increase the permeate flux, reduce the energy requirements, and prevent the temperature polarization effect. These T1 topics were already discussed above in relation to Table 3. Although the topics in the T2 set (1153 articles) were handled relatively less than T1, still they were the basic subject of MD studies. Membranes are an essential part of MD systems; thus, membrane engineering is a hot topic. MD membranes can be classified according to the membrane material (e.g., polymer or ceramic), membrane preparation technique (phase inversion or electrospinning), polymer type (polyvinylidene fluoride (PVDF), polypropylene (PP), poly(vinylidene fluoride-hexaflouropropylene) (PVDF-HFP), or polytetrafluoroethylene (PTFE)), and membrane type (flat sheet, hollow fiber, or nanofiber). Various researchers have carried out progressive studies on the preparation and modification of membranes specifically for MD [78][79][80][81][82]. The membrane surface modification topic can be revealed by the word surface in the T2 cluster. This is an effective method used to customize the surface of membranes for a specific application, to increase the MD performance, to minimize the wetting of membrane pores, and to reduce fouling problems, among other uses [83,84]. Thanks to the studies performed on membrane surface modification, MD applications have been extended to many wastewater types [85][86][87][88]. The term PVDF was included in the T2 cluster since this polymer, which is formed by -(CH 2 CF 2 ) n -repeating units, exhibits excellent thermal stability, good processability, a high degree of mechanical strength, and robust chemical resistance, among other properties. In addition, PVDF can be dissolved in a variety of solvents, including N,N-dimethyl acetamide (DMAc), dimethyl formamide (DMF), and N-methyl-2-pyrrolidone (NMP) [89]. The word "contact" that appeared in T2 could refer to the MD configuration of direct contact membrane distillation and the contact angle of the membrane surface. The contact angle (θ) is a macroscopic expression of the complex interaction between a liquid and a solid surface that can provide information about the hydrophobic character and wetting of the membrane surface along with its chemistry and topography [90][91][92]. In this context, the water contact angle of the prepared membranes also sheds light on its rejection factor (i.e., performance). Membrane pore wetting in MD results in a decrease in the produced water quality, affecting the overall long-term stability of the membrane and its lifetime. The term "contact" may also refer to DCMD, which is the most used MD configura-tion [14,72,93,94]. Although MD exhibits more selectivity than other membrane separation processes, the wetting phenomena is one of the drawbacks hindering the industrial potential implementation of MD technology [95,96]. Therefore, the efforts made in preparing hydrophobic or super-hydrophobic membranes are quite high, and this is the reason why the term "hydrophobic" appeared in T2. In general, as can be understood from all words appearing in the T2 clusters (such as surface, PVDF, hydrophobic, and contact), it can be confirmed that the second main hot topic of MD was membrane engineering.

Local MD Subjects
To find local clusters in the domain, the min_cluster_size value was set to 10, which meant creating a cluster if 10 or more articles contained the same topic. With this value, we aimed to maintain the stability of clusters containing very few articles at a certain level and to provide convenience in terms of interpretability. The created cluster number was 63. Note that 1173 documents were marked as outliers and did not belong to any cluster. Before proceeding to the presentation and interpretation of the generated themes, the BERTopic model provided the opportunity to fine-tune the results by revealing a similarity of concepts. The similarity matrix can be seen in Figure 5. As can be seen in Figure 5, there was a high similarity between Topic 5 (electrospunnanofibrous-nanofiber-membranes) and Topic 14 (superhydrophobic-nanofibrouselectrospun-electrospinning), with a value of ~0.76. Therefore, the results were adjusted by combining the 5th and 14th topics, and the number of clusters was reduced to 62. The resulting topics with their descriptive terms and the distribution of the topics in a 2D plane can be seen in Table 5 and Figure 6, respectively. Table 5. Local topics in the MD domain (min_cluster_size = 10).

Topic No Number of Papers Topic Name
T1 224 solar-energy-water-desalination-production-collector-thermal-unitplant-collectors T2 175 scaling-crystallization-brine-crystals-scale-MD-membrane-gypsum-recovery-RO T3 159 electrospun-nanofibrous-nanofiber-electrospinning-membranes-ENMs-superhydrophobic-layer-membrane-fabricated As can be seen in Figure 5, there was a high similarity between Topic 5 (electrospunnanofibrous-nanofiber-membranes) and Topic 14 (superhydrophobic-nanofibrouselectrospun-electrospinning), with a value of~0.76. Therefore, the results were adjusted by combining the 5th and 14th topics, and the number of clusters was reduced to 62. The resulting topics with their descriptive terms and the distribution of the topics in a 2D plane can be seen in Table 5 and Figure 6, respectively. Table 5. Local topics in the MD domain (min_cluster_size = 10).

Topic No Number of Papers Topic Name
T1 224 solar-energy-water-desalination-productioncollector-thermal-unit-plant-collectors 28 fiber-water-hollow-desalination-feed-modulemodel-flow-rate-temperature Table 5. Cont.   Topic Name   T51  15 wetting-detection-pore-intrusion-wetted-liquidpressure-sucrose-distillate-Tf As indicated in Table 5, there were 62 specific topics in the MD literature. Although it was not possible to interpret each topic individually in the present study, local studies of MD were most notable as solar applications (224 articles). Since MD is a thermally driven technology, interest in adopting solar-powered MD systems for desalination is expanding globally. Different types of solar systems have been successfully combined with MD, including heating with flat-plate solar collectors, heating with evacuated-tube solar collectors, heating with solar concentrators, powering with a solar pond, and photothermal collectors, among others [1,97,98]. Scaling, which is a phenomenon in which crystallization and/or precipitation of soluble salts occurs on the membrane surface [99], was included within the T2 topic of MD. Some ions in feed solutions, such as calcium and magnesium, may undergo chemical reactions to create carbonates or hydroxides, which then induce membrane scaling [100]. During a long-term MD operation, these scalants may obstruct membrane pores and eventually induce wetting, reducing the permeate flux and rejection factor as a consequence. Surface and bulk crystallization are the two processes through which mineral scalants deposit and develop on membrane surfaces [101]. Efforts have been made to overcome this important problem (175 articles). The third topic in Table 5 is nanofiber membrane fabrication via the electrospinning technique (159 articles). Electrospun nanofibrous membranes (ENMs) exhibit various advantages compared to phase-inversion membranes, such as their very high void volume fraction, high surfaceto-volume ratio and hydrophobic character, and energy efficiency, among others [102,103]. Heat and mass transfer in MD are two important mechanisms affecting the produced vapor flux and thermal efficiency. Both occur simultaneously in MD systems [74,[104][105][106][107][108]. Since DCMD is the commonly used configuration, it is expected that heat and mass transfer mechanisms are mostly investigated for this MD variant. In Table 5, it can be seen that 119 articles were grouped in the T5 topic that included AGMD, which was the second most used MD configuration, since condensation is carried out inside membrane modules over a condensing cold surface. However, due to the localized air resistance between the membrane and the condensing surface, the resulting AGMD permeate vapor flux is often minimal, although the proposed module designs with heat recovery allow a high energy efficiency [109]. The low heat transfer via conduction through the membrane fol-lowing Fourier's law results in a low conductive heat loss through the membrane and a high thermal efficiency [110]. Polymeric hollow-fiber membranes have been used widely in most MD separation applications [111] because of their higher mechanical stability and packing density [112]. This is why this type of membrane was included in the sixth topic, with 90 articles. The two major methods considered for hollow-fiber membrane preparation were non-solvent-induced phase separation (NIPS) and thermally induced phase separation (TIPS). Because hollow-fiber membranes exhibit some unique advantages, including self-supporting (i.e., they do not require any support to withstand operation conditions) and their variety of possible arrangements in modules to achieve a high packing density and optimal fluid dynamics, reducing both the temperature and the concentration polarization effect, they attract much interest in the MD research field [113]. In general, Table 5 exhibits very useful information related to hot MD topics. This information includes polymer types (PVDF, PP, PS, and PTFE), polymer additives (carbon nanotubes (CNTs) and surface-modifying macromolecules (SMMs)), membrane configurations (VMD and SGMD), fields of application (urine, leachate, and arsenic removal), module geometry (spacer and channel), theorical modeling (Stefan-Maxwell and CFD), artificial intelligence applications (Neural Networks and modeling), and hybridization with other separation systems (MBR and FO) investigated in MD. With the information obtained in the BERTopic modeling, MD researchers may be aware of the current status of the subject on which to work. In addition, the local topics identified through the BERTopic approach listed at the end of Table 3 can be developed in the future. Unexplored subjects should be tackled. Identifying the MD research gaps will further boost MD knowledge and advance MD technology, keeping researchers away from centralized topics. Among the research topics registered and with few published articles are heavy metals, toxic gases, and acids in wastewaters, which are major problems in industries. These topics offer great opportunities to promote and prove the versatility of MD for its industrial implementation. Topic modeling revealed the methods applied in research development (membrane type, configuration, modeling approach, etc.). For instance, DCMD is predominantly used in olive oil, polyphenols, olive mill wastewater processes (T55), while other MD configurations were not identified. In the T58 topic, the nanofiltration (NF) separation process was also involved in Li + recovery from brines. In this case, is it possible that any other combination with other more effective membrane separation processes such as reverse osmosis (RO), forward osmosis (FO), electrodialysis (ED), etc., will result in a better treatment efficiency or energy consumption?

Topic No Number of Papers
Another point to be mentioned regarding Figure 6 and Table 5 is that 2511 papers were assigned to topics, but 1173 papers were highlighted as outliers and did not belong to any topic. The BERTopic architecture contains methods for reducing the number of outliers. The first approach was to adjust the min_samples parameter in HDBSCAN. This value was automatically set to the same value as the min_cluster_size. In addition, the reduce_outliers function in the BERTopic algorithm attempted to reduce the outliers by forcing them into a cluster. If an analyst would like to allocate every data instance to a cluster (generating no outliers), k-means can be used instead of HDBSCAN in the third step. In this study, we optimized the number of clusters. However, extra forcing of the outliers into a cluster was a bad option, since it changed the underlying structure of the data and decreased the clustering's quality. As a result, assigning them to a cluster could alter the cluster's center, shape, and size. In addition, decreasing the similarity within the cluster while increasing the similarity across clusters could lead to misleading or erroneous topics. Another approach may be to reduce the min_cluster_size to a value lower than 10. However, the issue in this last case is that tiny subjects were created and were so irrelevant that they could be ignored. Furthermore, a higher number of clusters could result in more complicated and less informative graphs. As a consequence, the min_cluster_size was kept at its optimal value. The outliers were defined with the following words: membrane-water-flux-distillation-membranes-md-feed-process-temperature-heat. These terms were similar to those in Table 3 that defined the collection but with a minor change. Instead of "performance", the term "heat" was considered. This indicated that the 62 topics in the local topics covered more regarding the performance and ignored heat in MD applications. As indicated in Table 5, there were 62 specific topics in the MD literature. Although it was not possible to interpret each topic individually in the present study, local studies of MD were most notable as solar applications (224 articles). Since MD is a thermally driven technology, interest in adopting solar-powered MD systems for desalination is expanding globally. Different types of solar systems have been successfully combined with MD, including heating with flat-plate solar collectors, heating with evacuated-tube solar collectors, heating with solar concentrators, powering with a solar pond, and photothermal collectors, among others [1,97,98]. Scaling, which is a phenomenon in which crystallization and/or precipitation of soluble salts occurs on the membrane surface [99], was included within the T2 topic of MD. Some ions in feed solutions, such as calcium and magnesium, may undergo chemical reactions to create carbonates or hydroxides, which then induce membrane scaling [100]. During a long-term MD operation, these scalants may obstruct membrane pores and eventually induce wetting, reducing the permeate flux and rejection factor as a consequence. Surface and bulk crystallization are the two processes through which mineral scalants deposit and develop on membrane surfaces [101]. Efforts have been made to overcome this important problem (175 articles). The third topic Each theme encountered in the dataset was represented by a set of words, but not all these words represented the topic equally. With the help of a bar chart, the importance of the words based on c-TF-IDF score was visualized. The topic term scores can be seen in Figure S2 for each concept.
Naturally, the first word of each topic had the highest value in terms of representing the concept. The highest term score belonged to the word "boron" in the 60th cluster (~0.177), which was about 3 times more impactful than the other words in the same topic and alone could express the definition of the topic. Again, the term "urine", with a c-TF-IDF score of~0.169, could by itself identify what Topic 49 dealt with. Clusters 58, 42, 37, and 17 were examples of themes in which the importance of a single word was high in defining these topics. However, as shown in Figure S2, there were cases in which all words in a topic had an equal effect in defining that concept. For example, it was observed that the c-TF-IDF scores of all 10 words in Topic 25 were close to each other (between~0.015 and~0.023). Again, in Topic 24, while the lowest term score was~0.019, the highest term score was not far from this value (~0.024). The 4th, 38th, and 3rd topics can be given as examples of such clusters. As can be seen in the term-score bar graph ( Figure S2), the choice of the quantity of the words that are needed to represent a theme is important in topic modeling approaches. Since the dataset in this study contained results in which both a single word could describe the topic and the effect of all terms in the cluster was equal, it was understood that keeping the number of descriptive terms high was more useful.
Topics over time is a statistical procedure applied to identify how a given subject in a set of documents evolves with time. It is a useful approach that helps to understand the evolution of ideas, trends, and interests. Figure S3 shows the evolution (topics over time) of the 62 local topics detected in the MD domain. The most notable situation shown in Figure S3 is that some subjects were studied with greater momentum in recent years, while others were not. By employing a simple linear regression, it was found that the top five subjects with the highest slope were T1, T2, T3, T4, and T5, with~0.339,~0.254,~0.260, 0.171, and~0.170 values, respectively. Although these topics were the most researched, it was evident that interest in this research is also increasing every year. The topics with the lowest slope were T44, T53, T57, T59, and T61, with values of~0.010,~0.008,~0.009,~0.015, and~0.014, respectively. This result showed that the popularity of the last five topics has remained stable and have been studied by limited research groups over the years. T1, T6, and T23 were the earliest topics investigated in 1967, when MD was first introduced.
Considering the last 5 years may be a useful way to explore the distribution and density of the topics. T1, T2, T3, T4, and T9 were the most emphasized subjects, with 115, 89, 103, 57, and 58 reported studies, respectively. The topics T44 and T53 have been ignored, since only one study corresponded to each one during the last 5 years. The number of topics with more than 25 papers published during the last 5 years was 16, while the number of topics with less than 10 papers published during the last 5 years was 23. Figure S3 also indicates that there were topics that exhibited similar patterns over time. In terms of similarity, the pairs T59-T60 and T48-T58 could be considered the closest to each other. This indicated that MD researchers sometimes have increasing or decreasing interests in different topics during the same period of time.

Conclusions
Membrane distillation (MD) is a promising separation technology offering appropriate solutions to the worldwide water issue by treating saline waters. This membrane process has received substantial investigation, progressing from laboratory tests and novel membrane fabrication to systems development, theoretical modeling, and optimization, making it a highly promising separation method. The present study used a state-of-the-art artificial intelligence approach-topic modeling-to uncover the main and refined topics in the MD literature. The topic modeling method has the power to provide detailed insights into a corpus by finding the themes in a very short time with substantially less effort. In this study, a dataset that included 3684 articles downloaded from the Scopus database on 23 January 2023 was analyzed using the recently developed BERTopic architecture. Depending on the results, when there was only a single cluster, the membrane research was mainly defined with the following words: "membrane", "water", "distillation", "flux", "membranes", "MD", "feed", "process", "temperature", and "performance". When the min_cluster_size parameter was set to 1000, the dataset could be divided in two clusters. Thus, two global topics together with their descriptive words could also be revealed. In one cluster, the most globally researched MD topics (2121 documents) regarded feed wastewater content, permeate flux enhancement, temperature polarization, heat, and energy requirements. The other cluster (1153 studies) was mostly focused on surface modification, PVDF polymer applications, hydrophobic membrane production, and the direct contact membrane distillation (DCMD) configuration.
The local topics were also revealed in the present study. When the min_cluster_size parameter was set to 10 (i.e., to create a cluster if 10 or more articles contained the same topic), 62 specialized topics were found to be investigated by researchers. Among these local topics, the most emphasized one was MD solar applications (224 research articles). The second most researched local topic regarded the membrane scaling problem and membrane crystallization applications, with 175 articles. Among the local topics, boron and boric acid removal via the air gap membrane distillation (AGMD) configuration (11 articles), lactose removal from diary effluents (10 articles), and elimination of some heavy metals and sulfur from electroplating wastewater (10 articles) were the least covered themes. Through the method followed in this research study, attempts were made to guide scientists who are performing research studies on MD by alerting them to the topics that have been overemphasized and/or ignored. The results of the present research may help MD researchers to have an idea regarding the topic on which they are working today and to choose the topic of their future MD scientific studies. It should be noted that since the number of scientific publications on MD is increasing exponentially every year, it is important to update the MD topic modeling analyses in the future.
Author Contributions: All authors contributed equally in terms of investigation, methodology, conceptualization, and validation; E.A., software; E.A., data curation; E.A., writing-original draft; M.K., writing-review and editing; E.A., visualization; M.K., supervision. All authors have read and agreed to the published version of the manuscript.