Identifying and Analyzing Topic Clusters in a Nutri-, Food-, and Diet-Proteomic Corpus Using Machine Reading

Nutrition affects the early stages of disease development, but the mechanisms remain poorly understood. High-throughput proteomic methods are being used to generate data and information on the effects of nutrients, foods, and diets on health and disease processes. In this report, a novel machine reading pipeline was used to identify all articles and abstracts on proteomics, diet, food, and nutrition in humans. The resulting proteomic corpus was further analyzed to produce seven clusters of “thematic” content defined as documents that have similar word content. Examples of publications from several of these clusters were then described in a similar way to a typical descriptive review.


Introduction
Nutrients and energy intake contribute to the initiation, progression, and outcome of multiple diseases [1]. However, many of the molecular mechanisms by which food components initiate diseases are not well defined, which hampers early detection of chronic diseases and the influence of nutrition on these processes. Transcriptomic [2], metabolomic [3], and proteomic technologies (this review) are increasingly used to probe subtle changes in cells and molecules in blood that are affected by different diets, nutrients, and food components.
Although blood is a key transport process to deliver metabolites to and from various organs, as well as the garbage system for removing unused end products and cell debris, both of which can be assessed from a venous blood draw. Sampling other human tissues, such as from adipose or muscle, is more invasive and has more challenges for sample preparation. Identifying biomarkers of exposure (e.g., [4,5]) is a primary goal of many of research studies and results of omic analyses of blood components can be used to develop predictive models that may explain the variability of nutrition response (e.g., [6]) Compared to genomic and transcriptomic technologies, high-throughput proteomic methods have been the most challenging to develop because of the broad range in concentration (particularly in the blood) and the extensive variation in physicochemical properties of proteins. Nevertheless, advances in proteomic technologies such as antibody-based multiplexed proximity extension assays [7], mass spectroscopic instrumentation and workflows [8,9], and DNA aptamer technologies [10] now permit the analysis of several thousands of proteins simultaneously. The Human Protein Atlas lists 4072 proteins in plasma detectable by mass spectroscopy (https://www.proteinatlas.org/humanproteome/blood+ protein/proteins+detected+in+ms (accessed on 18 September 2022)) and Somalogic's Somascan TM platform can quantify over 7000 proteins (https://somalogic.com/specificity/ (accessed on 18 September 2022)).
The advancements in proteomic technologies are increasingly being used to identify clinically and nutritionally relevant biomarkers such as receptors, enzymes, and transporters. A PubMed search for (i) proteomics and nutrition, (ii) proteomics and diet, and (iii) proteomics and food listed 4970, 2766, and 10,485 citations, respectively (as of 18 September 2022). Classical manual methods of reviewing this literature would necessarily require restricting the search to more specific targets, eliminating the possibility of a comprehensive survey of proteomics in nutri-, food-, and diet-proteomics. An alternative approach is to use machine reading technology to extract and analyze the corpus of these topics.
We previously developed a natural language processing pipeline that parses, annotates, and analyzes~37 M citations and publications in the National Library of Medicine (e.g., PubMed and PubMed Central) [11], as well as extracts semantically meaningful relationships between the labelled entities. The relation extractor module of the pipeline is built on a transformer-based language model [12] and fine-tuned to label the meaning and directionality of the extracted relationships. In this report, we used parts of the machine reading pipeline to identify, parse, and analyze all articles and abstracts on proteomics, diet, food, and nutrition in humans. The resulting proteomic corpus was analyzed to produce seven clusters of "thematic" content defined as documents that have similar word content. Examples of publications from several of these clusters are then described in a manner similar to a typical descriptive review. We propose that this machine-guided approach facilitates a more objective and systematic review of the articles in this domain.

Querying and Document Parsing
The machine reading pipeline was essentially as described in [13] (excluding the relation extraction module) but briefly, the initial step in the pipeline development was to parse~33 M citations in PubMed and~2.4 M full-text records and annotate/index the concepts of interest ( Table 1). The NCBI e-utils API service was used to fetch articles returned from the queries described in Table 1. The queries included both keywords and MeSH terms, so as to not rely exclusively on a set of query terms (as is the case with keyword search) but also to ensure inclusion of the latest articles (which are often missing from MeSH term search, due to the time lag in MeSH indexing). Further, within the MeSH term search, we queried using both the standard [MH] and [MAJR] tags, which restricts to articles with primary importance of the queried MeSH term. This allows separate assessment of the corpus sizes of these two searches. It is expected that the articles from the [MAJR] search are a strict subset of those from the [MH] search. The articles returned from the queries were parsed from XML to json using the PubMed parser Python library, and then the abstract and full text were split into sentences using the scispaCy en_core_sci_sm sentencizer model.

Document Annotation
The proteomic-nutrition corpus was annotated using various approaches, depending on the entity type. The DNorm ( [14] and https://www.ncbi.nlm.nih.gov/research/bionlp/ Tools/dnorm/(accessed on 20 June 2020)) annotation tool from NCBI was used for disease annotations which provides integrated functionality for disease normalization to MeSH IDs. The GNormPlus ( [15] and https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/ gnormplus/(accessed on 20 June 2020)) annotation tool from NCBI was used to integrate functionality for gene normalization to NCBI gene IDs. These results are discussed below and provided in an interactive Supplementary File.

Co-Mention Analysis
An analysis was performed to identify sentences comentioning entity pairs of interest. Each comention relation is summarized in a network and/or table (examples below) with the following details: (i) the origin sentence and (ii) tagged entities (proteins or disease), labeled as V1 and V1 (vertex 1 and 2). Because comentions are nondirected, there is no semantic difference between V1 and V2. V1 and V2 are further described in terms of: (i) text_found which is the exact text representing the given entity, (ii) preflabel for the given entity (which serves to collapse synonyms for a common entity)-the preflabels (and not text_found) are the nodes in the network, and (iii) type entity which in this case are proteins or diseases. These results are discussed below with an interactive graph and a table of comention statements provided in the interactive Supplement Files SB.

Document Clustering
To gain insight on the thematic content of proteomic-nutrition corpus identified by the pipeline, we performed document clustering using the term-frequency-inverse document frequency (tf-idf) metric. Briefly, this metric describes the importance of a given word in each (and every) document within the context of a larger corpus. tf-idf is highest when a word is common within a given document and rare in the rest of the corpus. Words with high tf-idf in document are therefore loosely analogous to keywords. We then used these numeric vectors to cluster the documents into thematic groups with K means clustering. For the K means clustering, we used a plot of the within-cluster sum of squares to decide on a suitable value for K. T-Distributed Stochastic Neighbor Embedding (t-SNE) was used for dimensionality reduction [16] and visualization [17]. Tf-idf vectorization, K means clustering, and t-SNE were all performed using the scikitlearn library for Python. An interactive version of the t-SNE plot and a summary table are provided in File S1, Figure S7.

Disease Annotation
The Database for Annotation, Visualization, and Integrated Database (https://david. ncifcrf.gov/list.jsp (accessed on 20 June 2020)) was used for disease annotation.

Document Annotation
Annotation of the proteomic-nutrition corpus by DNorm [14] and GNormPlus [15] identified the diseases and gene/proteins, respectively, most often mentioned in the nutrition-proteomic corpus (Figure 2A,B, Supplement SA, Sections 4). Importantly, these algorithms perform both tagging and normalization (i.e., collapsing of potentially long lists of tagged synonyms to a common label) of terms in the corpus. GNormPlus does not distinguish between genes and proteins in the document annotation.

Document Annotation
Annotation of the proteomic-nutrition corpus by DNorm [14] and GNormPlus [15] identified the diseases and gene/proteins, respectively, most often mentioned in the nutrition-proteomic corpus (Figure 2A,B, Supplement SA, Section 4). Importantly, these algorithms perform both tagging and normalization (i.e., collapsing of potentially long lists of tagged synonyms to a common label) of terms in the corpus. GNormPlus does not distinguish between genes and proteins in the document annotation.

Comention Analysis
Whereas Figure 2 provides an overview of the diseases and proteins studied in the nutrition-proteomic corpus, comention analysis identifies a link between two entities, in this case sentence-level comentions between proteins and diseases, which can be displayed as a network ( Figure 3A) or in tabular form ( Figure 3B). The nutrition-proteomic corpus consists of 5373 comentions between proteins and diseases (Table S5.2). Comentions have no direction or specific semantic meaning-the mentioned protein could affect the disease or conversely, the disease could over-express the protein-as two simple examples of many types of possible semantic relationships.

Comention Analysis
Whereas Figure 2 provides an overview of the diseases and proteins studied in the nutrition-proteomic corpus, comention analysis identifies a link between two entities, in this case sentence-level comentions between proteins and diseases, which can be displayed as a network ( Figure 3A) or in tabular form ( Figure 3B). The nutrition-proteomic corpus consists of 5373 comentions between proteins and diseases (Table S5.2). Comentions have no direction or specific semantic meaning-the mentioned protein could affect the disease or conversely, the disease could over-express the protein-as two simple examples of many types of possible semantic relationships.

Document Clustering
An important step in many NLP analyses is the conversion of individual words and/or documents to numeric vectors-aka vectorization-which allows for mathematic analysis of the corpus by a variety of methods. For this analysis, we used the tf-idf metric as a document-level vectorization and used K-means analysis to identify seven clusters of these document vectors with increased similarity in keyword content within each cluster. Rather than using MeSH terms group or categorize the documents in our corpus, tf-idf was used because it is a purely data-driven and discovery-oriented approach that does not rely on a pre-defined set of categories that may or may not adequately describe the corpus. From the document clusters, we calculated the average tf-idf score per word within each cluster to identify the top 10 terms per group (Table 2). This "theme" variable describes the main topic of each cluster and provides a convenient filter for identifying publications of interest ( Figure 4; the interactive version is in Supplement SA, Figure S6.3). The trends in the topic areas by year of publication is shown in Figure 5 (the interactive version is in SA, Figure 2).  Table S5.2 with the PubMed ID (pmid), the extracted sentence, publication date, the gene/protein, and the disease. PMIDs retain links to the PubMed abstracts. Individual columns can be sorted (arrows in column headings) and searched.

Simple Quantitative Charactercistics of the Thematic Clusters
The publications in the clusters can be further analyzed for a variety of different "thematic characteristics." For example, the occurrence and statistical overrepresentation of individual proteins (Table 3), disease associations (Table 4), or functional analysis using DAVID [18], STRING [19], or KEGG [20], or their functional analysis tools. The same proteins can be found in different clusters (see Supplement File SB for full list). The proteins in each cluster are (of course) related to and help drive the thematic content.

Disease Annotation
The DAVID Functional Annotation Tool [18] was used to analyze disease associations of the proteins in each of the thematic clusters to provide more context on the key word results used to define the clusters (Table 4). In general, the diseases found were consistent with the keywords identified by top tf-idf keywords in each cluster. As expected, the larger the cluster, the more diseases are associated with the proteins. Cluster C is an exception because many of the publications in this theme are related to milk production in agricultural animals and humans.

From Group Level Data to Individual Papers
Analyses of all articles in the nutrition-proteomic corpus (Figures 1-4 and Tables 1 and 2) or characterizations of the publications grouped by the tf-idf cluster analysis (Tables 3 and 4) provide a metaphorical~30,000 foot or~1000 ft view of the extracted articles. Individual abstracts or publications can also be viewed and further analyzed by manual text mining typical for preparing publications, systematic reviews, and meta-analysis. For illustrative purposes, Table 5 compares the top words of the thematic cluster with the top words in the document for all the publications described in the following section. By definition, the top cluster words will also show more frequent use within most or all documents in the cluster. However, it is not necessarily the case that top cluster words should also be top document words for each document in the cluster. Selected articles from each cluster are discussed to show the utility of the pipeline and subsequent data-driven cluster analysis. We did not focus on a particular topic for each cluster, but selected articles based current interest to clinical and research nutritionists interested in proteomic analysis of health, disease, and treatments. Nutritional research is increasingly focused on measuring the effect dietary patterns on health and disease processes rather than how specific foods or isolated nutrients alter physiology.
Proteomic profiles in 1713 participants of the Framingham Heart Study were analyzed in three different dietary patterns, (i) Alternative Healthy Eating Index (AHEI), (ii) the Dietary Approaches to Stop Hypertension (DASH) diet and (iii) the Mediterraneanstyle (MDS) [21]. DNA-based aptamers (SOMAscan) were used to find unique associations between dietary patterns and 17 plasma proteins with AHEI, 52 with DASH, and 3 with MDS. Significant proteins enriched biological pathways involved in cellular metabolism/proliferation and immune response/inflammation, providing insights into the molecular mechanisms mediating diet-related disease. Although the specific proteins may become biomarkers of these three dietary patterns, the results need replication in independent populations.  Table 3. Some abbreviated terms were left unresolved to provide examples of the search results. In some cases, these terms are filtered for subsequent analysis. References for each paper in a cluster can be obtained by clicking on a dot in Figure S4 of Supplementary file SA.
A separate study examined the AHEI and modified versions of the Mediterraneanstyle Diet Score (mMDS) and mDASH in 6360 participants (mean age 50 years; 54% women) in the same Framingham Heart Study [22]. The proteomic analysis used a modified sandwich ELISA method multiplexed on a Luminex xMAP (Sigma-Aldrich, St. Louis, MO, USA). The associations between diet and 71 candidate cardiovascular disease (CVD)-related proteins were examined in individuals against the three diet quality scores. Mediation analysis identified proteins that mediated the associations between diet and incident CVD and all-cause mortality. The results indicated that a healthy diet is associated with circulating cardiovascular disease-related protein biomarkers, largely representing regulators of inflammatory pathways in the group of middle-aged and older participants. Four proteins-B2M (beta-2-microglobulin), GDF15 (growth differentiation factor 15), sICAM1 (soluble inter-cellular adhesion molecule 1), and UCMGP (uncarboxylated matrix Gla-protein)-may mediate the association of diet with health outcomes.
The CORonary Diet Intervention with olive oil and cardiovascular PREVention [COR-DIOPREV] study [24] evaluated the effect of two healthy dietary models (a Mediterranean diet and a low-fat diet) on endothelial function, measured by flow-mediated dilation (FMD), in patients with coronary heart disease (CHD). Patients with CHD following the Mediterranean diet had higher FMD compared with those on a low-fat diet, regardless of the severity of endothelial dysfunction. The Mediterranean diet also led to better endothelial function, enhanced endothelial repair mechanisms, and a reduction in the mechanisms associated with endothelial damage. Patients who consumed the Mediterranean diet had lower miR181c-5p levels as compared with a low-fat diet, inhibiting the proapoptotic action of this miRNA. miR181c-5p is also implicated in ROS synthesis, so the reduction in the levels of this miRNA would lead to a decrease in GPx3, an enzyme that catalyzes the reduction in ROS, as was observed in another proteomic study [36]. The Mediterranean diet may better modulate endothelial function compared with a low-fat diet and is associated with a better balance of vascular homeostasis in CHD patients, even in those with severe endothelial dysfunction.
Proteins from liver tissue samples from controls and patients with parenteral nutrition (PN)-associated liver disease (PNALD) were analyzed with the Isobaric Tag for Relative and Absolute Quantitation (iTRAQ)-based quantitative proteomics method [23]. A total of 112 proteins were found to be differentially expressed, of which 73 were downregulated, and 39 were upregulated in tissues from the PNALD group. These proteins were associated with mitochondrial oxidative phosphorylation, hepatic glycolipid metabolism (primarily involved in glycogen formation and gluconeogenesis), and oxidative stress involved in antioxidant change, such as, CYP2B6, DDAH1, and NDUFA1 that were significantly downregulated, and FABP5 and CAPG that were upregulated in the PNALD group, compared to the control group. This study identified candidate proteins as future PNALD biomarkers or therapeutic targets related to long parenteral nutrition therapy.
A follow-on study of the DioGENES weight maintenance study [37] analyzed 173 glycemic responders versus 201 glycemic nonresponders who had previously lost >8% of their body weight on a 800 kcal/d for 8 week diet. The two groups were comparable at baseline for body composition, glycemic control, adipose tissue transcriptomics, and plasma ketone bodies, but they differed significantly in their response to LCD, including improvements in visceral fat, overall insulin resistance (IR), and tissue-specific IR. Proteomics analysis was conducted using DNA aptamers (version 1 of Somascan, SomaLogic, Boulder, CO, USA) which revealed that total ApoE, ApoE2, ApoE3, and ApoE4 differed significantly between responders and nonresponders. In addition, proteomic pathway analyses highlighted other proteins involved in lipoprotein metabolism (APOA1, BMP1, FABP3 and ANGPTL4) that are key markers in obesity and NAFLD (e.g., [38,39]). Given the link between NAFLD, insulin resistance, and lipid metabolism, these markers, as well as other adipo-and hepatokines deserve further investigation in the context of weight loss and glycemic improvements following LCD intervention.  [26]), several attempts have been made to supplement infant formulas (IFs) with polar lipid and protein fractions of bovine. The most shared proteins across species were involved in protein/vesicle-mediated transport, along with major MFGM proteins such as BTN, ADPH, FABP, and MUC1. The main difference regarding human MFGM proteome was a higher enrichment in enzymes involved in lipid catabolism [27] and in a set of immune response proteins [28]. The similarities between human and cow MFGM proteome and molecular functions, suggesting that bovine milk, and more specifically bovine MFGM proteins, could be used as a supplement in infant formulas (rev in [26]).
Phosphorylation is a widespread posttranslational protein modification involved in regulation of many biological processes. Liquid chromatography mass spectroscopy (LC-MS/MS) quantitatively analyzed phosphorylation sites in human milk and colostrum fat globule membrane (MFGM) proteins [29]. A total of 71 phosphorylation sites in 48 human MFGM proteins differed between stages. These 48 phosphoproteins were mainly associated with immune-related processes. The majority of these phosphoproteins were involved in 16 of KEGG pathways including insulin, AMPK, Ras, PI3K-Akt, ErbB, apelin, and chemokine signaling pathways. SPRING functional protein association network analysis suggested more immune system process-related phosphoproteins in human colostrum MFGM than in mature MFGM, probably because of the important role that colostrum has in building the immune system of newborns [29].
Different storage conditions, and particularly temperature variation may affect the stability of the human milk proteome. More specifically, 22 of 110 quantifiable proteins significantly decreased after 48 h at 4.9 and 6.0 • C (rev. in [30]). Pasteurization of human milk changes milk components such as immunoglobulins and lactoferrin (rev. in [30]). An additional variable in milk composition may be caused by environmental factors. For example, children who grow up on farms are at lower risk of developing childhood atopic disease, which is called the "farm effect" [40]. A quantitative glycoproteomics analysis of 54 milk samples from Rochester urban/suburban and Old Order Mennonite mothers identified differences in 79 N-glycopeptides from 15 different proteins, including many involved in immune function [31]. These findings highlight the importance of understanding and recording metadata such as storage conditions, processing, handling, and origin of samples not only for proteome research, but also for nutritional content of milk products. The consumption of bread wheat (Triticum aestivum ssp. aestivum) products can cause celiac disease (CD), allergic reactions, and nonceliac wheat sensitivity (NCWS) (rev. in [32]). The proteomes of 15 representative varieties of the bread wheat flour and a spelt subspecies spelt were analyzed using nano LC-ESI-MS/MS. Although 81 proteins (e.g., alpha-amylase inhibitors, serpins, gliadins, glutenins, and Bowman-Birk trypsin inhibitor) had high heritability, protein expression was largely driven by environmental effects. Of the 3050 proteins expressed in spelt, 1555 proteins were differentially expressed depending on field locations. Similarly, 1166 of 2770 proteins in bread wheat showed differential expression, underlining the large environmental impact caused by type of nitrogen fertilization, weather conditions, and soil types (rev. in [32]).
Traditional food plants (TFPs) typically consumed by rural indigenous communities across the globe have been gaining increasing importance in maintaining genetic diversity as climate changes alter local environments (rev. in [33]). Proteomic analysis of samples of Manihot esculenta (cassava) obtained during cold or drought stress identified changes in levels of protein such as ATP synthase subunit beta, rubisco activase (RCA), rubisco, phosphoglycerate, chaperone peroxiredoxin, heat shock protein, glutathione transferase, and the drought-induced Di19-like protein (rev. in [33]). In addition to improving sustainable growth characteristics, proteomic analysis can also identify processes for improving the nutritional content of plants. For example, the expression of the phytoene synthase (EutPSY) gene was found to be correlated with the higher accumulation of lycopene in the silverberry (E. umbellate) (rev. in [33]). Understanding the proteomic changes caused by climate stress or the pathways responsible for levels of nutritionally important metabolites may allow for rapidly improving resilience and nutritional content in these and other plants using modern genetic modification procedures (e.g., CRISPR). No accepted diagnostic markers are known for the three subtypes (i.e., diarrheal, constipation, or mixed) of inflammatory bowel syndrome (IBS). Tandem mass tag (TMT)-based proteomics of samples were obtained from intestinal mucosa samples during colonoscopies of IBS-diarrheal and controls [34]. Eighty differentially expressed proteins were with relative expression levels of 48 up-regulated at 1.2 or greater (p < 0.05) and 32 downregulated at 1.2 or greater (p < 0.05). The identified proteins were significantly enriched in the nutrient ingestion pathways related to immune molecules and to FODMAP (fermentable oligosaccharides, disaccharides, monosaccharides, and polyols) metabolism including methanethiol oxidase (SELENBP), V-Set, and immunoglobulin domain containing 2 (VISG2), 7-dehydrocholesterol reductase (DHCR7), B cell receptor associated protein 31 (BCAP31), peroxisome proliferator activated receptor alpha (PPARA) and delta (PPARD), galactose mutarotase (GALM), carbonic anhydrase 2 (CA2), immunoglobulin kappa variable 1-33 (IGKV1-33), and fatty acid binding protein (FABP1), among others. These results confirmed changes in immune function pathways but also identified potential new explanations for why low FODMAP diets are not tolerated by IBS-D patients. 3.7.5. Cluster G-Cell, Protein, Cancer, Meat, Colorectal, Quality, Proteomics, Fish, Study, Muscle Intermittent fasting has been linked to improved health outcomes, but the molecular mechanisms are unknown. A mass spectroscopic proteomic analysis was conducted of serum samples from 14 healthy subjects who fasted from dawn to sunset (i.e., greater than 14 h daily) for 30 consecutive days but otherwise were not calorically restricted [35]. The untargeted serum proteomic profiling identified upregulation of key regulatory proteins involved in glucose and lipid metabolism, circadian clock, DNA repair, cytoskeleton remodeling, immune system, and cognitive function. The identified proteins are associated with protective effects for cancer, metabolic syndrome, inflammation, Alzheimer's disease, and several neuropsychiatric disorders. For example, a 9-fold increase was found for the large tumor suppressor kinase 1 (LATS1) and an 11-fold increase in NR1D1 nuclear receptor subfamily 1 group D member 1 (NR1D1), with a significant reduction in the amyloid beta precursor protein (APP), beta-1,4-galactosyltransferase 1 (B4GALT1), and ArfGAP with SH3 domain, ankyrin repeat, and PH domain 1 (ASAP1) at the end compared to before the 30-day intermittent fasting period.

Data-Driven Analysis and Organization of Literature
NLP methods have been rapidly adopted in the biomedical literature and, increasingly, the Electronic Health Record space (e.g., [41]), particularly since the development of the widely impactful transformer-based deep learning models [42]. A selection of applications of NLP for information extraction in biomedical/clinical text include: tagging [43], normalization and linking of biomedical entities [14], relation extraction [44], and event extraction [45]. To our knowledge, NLP has not been used for preparing narrative or systematic reviews although several reports compared ML approaches to guide initial selection of the literature corpus rather than using ML for downstream information extraction from the retrieved documents [46][47][48]. A PubMed knowledge graph connected (1) authors, their educational background, funding data, and affiliation history with (2) the diseases, drugs, genes, species, and mutations identified in their corpus of abstracts [49]. An alternative approach to literature retrieval, which we have applied in the domain of whole-food and nutrient effects on human disease [11], is to individually index all foods (to the extent reasonably possible) and nutrients in all biomedical abstracts and full text, and retrieve all articles referencing any food or nutrient.
Identification of the (in this case) nutrition-proteomic corpus presented here is naturally specific to the exact search terms described in Table 1 and, therefore, some proteomic studies in PubMed/PMC will not be the identified by the pipeline if they do not match the specific search terms (in this case, Table 1). As examples, our team previously used samples from a micronutrient intervention in healthy children aged from 9 to 13 years old [6] for three DNA-aptamer-based serum proteomic studies. One study explored the relationship of 117 pro-inflammatory proteins with different levels of DNA damage and EPA and DHA levels [50]. A second study explained levels of serum LTAH4 with plasma levels of cobalamin, riboflavin, pyridoxal, and homocysteine [51]. The third study identified 20 differentially expressed proteins (with iTRAQ technology) between 10 individuals with lower triglycerides, LDL, and VLDL compared to 10 individuals with higher lipid profiles [52]. Although these reports described analysis of nutritional biomarkers in serum (lipid profile, vitamin levels, and hundreds of other metabolites [6]), none of these papers appeared in the nutrition-proteomic corpus because they did not include the terms "nutrition" or "nutritional" (or corresponding MESH term) in the abstract and publication. Hence, they did not match our queries. Different search terms such as nutritional or serum biomarkers, blood metabolite levels, or lipid profiles and proteomics would identify those documents focused on those topics. However, for this review, we aimed to maintain a focus on articles with explicit reference to the field of proteomics in nutrition, diet, and food research. Our analyses did not assess the populations (e.g., description of participants in study), intervention (e.g., diets, drugs, or other), comparator (e.g., diseased versus control, or high fat group versus low fat group), or outcome (e.g., decreased biomarker or disease symptom)-PICO [53]-for each study in the corpus, which was beyond the scope of the topic for this review.
We used K-means cluster analysis to identify seven "themes" in the extracted nutriproteomic corpus. Others have reported methodological studies that compared various analytical methods (i.e., tf-idf, latent semantic analysis, topic modeling, self-organizing maps, and poison-based [46] or Gaussian mixture model [54]) to identify thematic cluster documents extracted using MESH terms [46] or the BioBERT language model [54]. Regardless of the methodological approach, cluster analysis further contextualizes documents identified by machine reading technologies.

Conclusions
The use of high-throughput quantification and analysis of proteins in response to nutrition, food, or diet differences has been increasing rapidly with the development of improved mass spectroscopic sample workflows (e.g., [8]), molecularly tagged antibody technology (OLink [7]), and aptamer technologies [10]. The machine learning pipeline used here identified 945 publications on proteomics in the domain of nutrition, food, and diet. Reviewing this number of publications requires some ideally unbiased type of selection method and criteria. Unlike manual sorting of this corpus, publications were grouped into thematic clusters based on their word usage (i.e., tf-idf). Each cluster of publications could be the basis of a narrative, systematic, or meta-analytic review.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/nu15020270/s1, Supplement A: (Supp_A_Monteiro_Morine) and Supplement B: (Supp_B_Monteiro_Morine_Disease_Association). Supplement A is an html file that can be opened by any browser. The file is interactive in that tables contain multiple tabs, can be sorted, and searched. Data in figures are linked to PubMed by hoovering over the bars or dot plots. Supplement B is the output of DAVID functional analysis for disease association of proteins in each cluster.