Quantitative Metabolomic Dataset of Avian Eye Lenses

: Metabolomics is a powerful set of methods that uses analytical techniques to identify and quantify metabolites in biological samples, providing a snapshot of the metabolic state of a biological system. In medicine, metabolomics may help to reveal the molecular basis of a disease, make a diagnosis, and monitor treatment responses, while in agriculture, it can improve crop yields and plant breeding. However, animal metabolomics faces several challenges due to the complexity and diversity of animal metabolomes, the lack of standardized protocols, and the difﬁculty in interpreting metabolomic data. The current dataset includes quantitative metabolomic proﬁles of eye lenses from 26 bird species (111 specimens) that can aid researchers in developing new experiments, mathematical models, and integrating with other “-omics” data. The dataset includes raw 1 H NMR spectra, protocols for sample preparation, and data preprocessing, with the ﬁnal table containing information on the abundance of 89 reliably identiﬁed and quantiﬁed metabolites. The dataset is quantitative, making it relevant for supplementing with new specimens or comparison groups, followed by data mining and expected new interpretations. The data were obtained using the bird specimens collected in compliance with ethical standards and revealed potential differences in metabolic pathways due to phylogenetic differences or environmental exposure. Dataset


Summary
Metabolomics is a field of study that aims to identify and quantify the small molecules, or metabolites, present in biological samples, such as cells, tissues, or fluids. These molecules are initial, intermediate, and end products of cellular processes that provide a Data 2023, 8, 125 2 of 8 snapshot of the metabolic state of a biological system. Metabolomics uses many analytical techniques, such as liquid chromatography-mass spectrometry (LC-MS) and nuclear magnetic resonance (NMR) spectroscopy-to measure and identify metabolites and generate metabolomic profiles. In medicine, metabolomics can be used to identify disease biomarkers, reveal the molecular basis of a disease, and monitor treatment responses. In agriculture, metabolomics can help to improve crop yields and to identify markers for plant breeding. In environmental science, metabolomics can be used to assess the impact of pollutants on ecosystems and monitor the health of natural systems. Overall, metabolomics provides a powerful tool for understanding biological systems and for developing new diagnostic, therapeutic, and environmental solutions.
Metabolomics of animals is a rapidly developing branch of "-omics" science, but it still faces several challenges that hinder its progress. One of the main difficulties is the complexity and diversity of animal metabolomes. Different animal species have distinct metabolomes, and even conspecific individuals can exhibit variations in their metabolomic profiles. This heterogeneity requires a large sample size to draw meaningful conclusions, making the analysis time-consuming and costly. Additionally, many metabolites in animal samples are present in low abundance, making their detection and quantification challenging. The lack of standardized sample preparation and analysis protocols also presents a challenge, making it difficult to compare results across different studies. Furthermore, the interpretation of metabolomic data can be complex due to the interactions between various metabolic pathways and the effects of external factors such as diet, environment, and genetics. These challenges have limited the widespread adoption of animal metabolomics and highlighted the need for continued development of new analytical techniques, standardization of protocols, and integration with other -omics disciplines. In addition, there is an insufficient number of publicly available datasets in the field of animal metabolomics, making it difficult for researchers to build on previous work.
The current dataset is a set of quantitative metabolomic profiles of lens tissues from 26 bird species from various orders. The main method of identifying and quantifying metabolites in tissue was NMR spectroscopy (LC-MS was used for complicated identification cases and cross-identification). The dataset includes not only the final table with metabolite concentrations but also all the raw NMR data and detailed protocols for sample preparation and data preprocessing. A part of this dataset formed the basis of a pilot study on the use of quantitative metabolomic profiling for taxonomic differentiation of species [1]. Also, analysis of a subset of this dataset made it possible to discover a new molecular ultraviolet filter in the bird's eye lens-nicotinamide adenine dinucleotide reduced (NADH) [2]. However, we are confident that the potential for using these data could be much wider, and making it available to the public will allow scientists to use it in their work. Metabolomic studies are most often represented by semi-quantitative LC-MS measurements, which are very protocol-and instrument-dependent and can hardly be re-used in other analyses. The data presented in this work are quantitative, making it as relevant as possible for supplementing with new samples or comparison groups or as a basis for completely new experiments. The data can also be integrated with other omics data, such as genomics, transcriptomics, and proteomics, to provide a more comprehensive understanding of biological processes and systems. Additionally, quantitative metabolomic data can be used to develop mathematical models of metabolic systems, which can aid in predicting metabolic responses to perturbations and designing new interventions.

Data Description
The dataset is represented by two main parts:

1.
Raw data-1 H NMR spectra (a description of the acquisition protocol is provided below in the methods section); 2.
Quantitative metabolomic data presented in the table (csv format), where the columns correspond to samples and the rows correspond to metabolites. Concentrations in the table are given in nmol per gram of wet tissue weight.
The research was carried out in line with the ARVO Statement for the Use of Animals in Ophthalmic and Vision Research and the European Union Directive 2010/63/EU on the protection of animals used for scientific purposes. Additionally, the study was granted ethical approval by the International Tomography Center SB RAS (ECITC-2017-02).
The bird specimens were obtained from three different sources (Table 1): (1) during hunting season with official authorization from regional Ministries of Ecology and Natural Resources (Dagestan Republic; Altay Republic; Omsk Region; Tyva Republic; Republic of Sakha; Novosibirsk Region, Russia) for the purpose of collecting biological material as part of the annual program for studying infectious diseases in wild animals, which was approved by the Biomedical Ethics Committee of FRC FTM, Novosibirsk, Russia (Protocols No. 2013-23 and 2021-10); (2) provided by the Center for the Rehabilitation of Wild Animals (CRWA, Novosibirsk, Russia) following humane euthanasia of birds that were mortally wounded; and (3) obtained through a special permit for scientific purposes from the Committee for the Protection of the World's Wild Animals of the Republic of Altay, Russia (#5, 21 August 2018). As an example of NMR spectrum identification, Figure 1 shows the 1 H NMR spectrum with signal annotations obtained for the rook (Corvus frugilegus) lens.
As an example of a possible analysis of the published dataset and its visualization, we provide the results of several algorithms for dimension reduction. The most commonly used data visualization method in metabolomics is principal component analysis (PCA). Principal component analysis is a statistical technique used to reduce the complexity of a dataset by identifying and representing its important underlying variables through a smaller set of linearly uncorrelated variables, known as principal components. The advantage of PCA is its high interpretability of contributions of old components to forming new ones, but it is not suitable for analyzing a large number of sample groups. Figure 2 (right bottom) shows that the significant differences between the pale sand martin, Anatidae, and Passeridae does not allow other species to form separate clusters. Two other Figure 1. Representative 1 H NMR spectrum of bird lens metabolome, obtained for Rook (Corvus frugilegus) lens. Abbreviations: 3-OH-i-Val-3-hydroxyisovalerate, ADP-adenosine diphosphate, ATPadenosine triphosphate, CHCl 3 -chloroform, Erg-ergothioneine, EtOH-ethanol, Gl-phCholglycero-3-phosphocholine, GSH-glutathione reduced, GTP-guanosine triphosphate, MeOHmethanol, myo-In-myo-inositol, NAD-nicotinamide adenine dinucleotide, pGlu-pyroglutamate, Tau-taurine. For amino acids, a standard tree letter code is used.
As an example of a possible analysis of the published dataset and its visualization, we provide the results of several algorithms for dimension reduction. The most commonly used data visualization method in metabolomics is principal component analysis (PCA). Principal component analysis is a statistical technique used to reduce the complexity of a dataset by identifying and representing its important underlying variables through a smaller set of linearly uncorrelated variables, known as principal components. The advantage of PCA is its high interpretability of contributions of old components to forming new ones, but it is not suitable for analyzing a large number of sample groups. Figure 2 (right bottom) shows that the significant differences between the pale sand martin, Anatidae, and Passeridae does not allow other species to form separate clusters. Two other algorithms provide much better results for separating species groups (Figure 2, top and bottom left). UMAP (uniform manifold approximation and projection) [3] and t-SNE (t-distributed stochastic neighbor embedding) are non-linear dimension reduction algorithms used for visualizing high-dimensional data. The main difference between them is that UMAP considers the global structure of the data and preserves more accurate distances between neighboring objects, while t-SNE tends to optimize locally and preserve higher object densities in the resulting clusters. UMAP's advantage is a faster runtime and better scalability, while t-SNE usually provides more intuitive visualizations for small datasets with many clusters. Figure 2 also demonstrates that for metabolomics data, UMAP better reflects the genetic relationship of species, i.e., better preserves the global structure, while t-SNE forms clusters by species, but they homogeneously fill the embedding space. . UMAP (uniform manifold approximation and projection) [3] and t-SNE (tdistributed stochastic neighbor embedding) are non-linear dimension reduction algorithms used for visualizing high-dimensional data. The main difference between them is that UMAP considers the global structure of the data and preserves more accurate distances between neighboring objects, while t-SNE tends to optimize locally and preserve higher object densities in the resulting clusters. UMAP's advantage is a faster runtime and better scalability, while t-SNE usually provides more intuitive visualizations for small datasets with many clusters. Figure 2 also demonstrates that for metabolomics data, UMAP better reflects the genetic relationship of species, i.e., better preserves the global structure, while t-SNE forms clusters by species, but they homogeneously fill the embedding space.

Sample Preparation
The process of preparing the lens samples was carried out following the detailed procedure described in [4]. The analyzed species are rather widespread in Siberia, and obtaining the samples was relatively straightforward. All individuals used in the study were adult wild-caught specimens collected between 2017 and 2022. After extraction from the eye, the lenses were cleaned, placed in individual cryotubes, frozen in liquid nitrogen, and stored at −70 • C until analyzed. For each sample, we analyzed one, two, or three lenses from different individuals depending on the lens size. Prior to homogenization, each sample was weighed. The lenses were homogenized in glass vials using a TissueRuptor II rotor-stator homogenizer (Qiagen, Venlo, The Netherlands) with 1600 µL of cold MeOH (−20 • C), followed by the addition of 800 µL of water and 1600 µL of cold chloroform. The mixture was shaken for 20 min in a shaker and then left at −20 • C for 30 min. After that, the mixture was centrifuged at 16,100× g, +4 • C for 30 min, which resulted in two immiscible liquid layers separated by a lipid-protein layer. The upper aqueous layer (MeOH-H 2 O) was collected, divided into two parts for NMR (2/3) and LC-MS (1/3) analyses, and vacuum-dried for further analysis.

NMR Measurements
For NMR measurements, the extracts were dissolved in 600 µL of D 2 O containing 2 × 10 -5 M of sodium 4,4-dimethyl-4-silapentane-1-sulfonic acid (DSS) as an internal standard and 20 mM of deuterated phosphate buffer (pH 7.2). The 1 H NMR measurements were conducted at the "Mass Spectrometric Investigations" Center of Collective Use, SB RAS, using an AVANCE III HD 700 MHz NMR spectrometer (Bruker BioSpin, Ettlingen, Germany). The NMR spectra for each sample in a standard 5 mm glass NMR tube were acquired using a 5 mm TXI ATMA NMR probe by summing 96 transients while maintaining the sample temperature at 25 • C and using a 90-degree detection pulse. To allow for the relaxation of all spins, a repetition time of 20 s was used between scans. Prior to acquisition, low-power radiation was applied at the water resonance frequency to presaturate the water signal.

LC-MS Measurements
The hydrophilic interaction liquid chromatography (HILIC) method was used for LC separation of the samples, utilizing a TSKgel Amide-80 HR column (4.6 × 250 mm, 5 µm) on an UltiMate 3000RS chromatograph (Dionex, Germering, Germany). A diode array UV-vis detector (DAD) with a 190-800 nm spectral range was used with a flow cell. The mobile phase consisted of solvent A, a 0.1% formic acid solution in H 2 O, and solvent B, a 0.1% formic acid solution in acetonitrile. The column temperature was set to 40 • C, and the flow rate was 1 mL/min. The injection volume for the samples was 10 µL. The gradient was as follows: 95% solvent B from 0 to 5 min, 95-65% from 5 to 32 min, 65-35% from 32 to 40 min, 35% from 40 to 48 min, 35-95% from 48 to 50 min, and 95% from 50 to 60 min. A flow splitter (1:10) directed the lesser flow after the DAD cell to an ESI-q-TOF high-resolution hybrid mass spectrometer maXis 4G (Bruker Daltonics, Bremen, Germany). The mass spectra were recorded in positive mode with a range of 50-1000 m/z. The MS setup, calibration procedure, and data processing were previously described [5].

Metabolite Identification and Quantification
Metabolite identification was conducted by analyzing their NMR spectra, sourced from literature, databases (HMDB, METLIN, BMRB, and SpectraBase), and our in-house NMR library [5][6][7]. In instances where NMR signal assignment was not straightforward, we verified the identification of metabolites by spiking the lens extract with commercially available standard compounds. To identify unknown signals, we fractionated the metabolomic extract by HILIC and conducted MS, MS/MS, and NMR analysis on each fraction, as detailed in [5]. Despite these efforts, several signals in the NMR spectra remained unassigned. The metabolite concentrations in the lenses were determined in nmol/g by integrating NMR signals relative to the internal standard DSS and then normalized to the wet weight of the tissue. The baseline processing, identification, and integration of spectral NMR peaks (quantification) were done using the MestReNova v12.0 (Mestrelab Research, A Coruna, Spain). On average, 60-80 compounds were identified for each species. However, quantifying some compounds was unreliable due to weak signals or overlapping with other signals. The final table contains only 89 reliably identified and quantified metabolites. The raw data could contain more information on yet unidentified metabolites.

Informed Consent Statement: Not applicable.
Data Availability Statement: Raw NMR spectra, the descriptions of specimens and samples, metabolite concentrations, and the preliminary metabolomic analysis are available at the MetaboLights repository, study identifier MTBLS7739 (https://www.ebi.ac.uk/metabolights/MTBLS7739 (accessed on 30 June 2023)).