MSProfileR: An Open-Source Software for Quality Control of Matrix-Assisted Laser Desorption Ionization–Time of Flight Spectra

: In the early 2000s, matrix-assisted laser desorption ionization–time of flight mass spectrometry (MALDI-TOF MS) emerged as a performant and relevant tool for identifying micro-organisms. Since then, it has become practically essential for identifying bacteria in microbiological diagnostic laboratories. In the last decade, it was successfully applied for arthropod identification, allowing researchers to distinguish vectors from non-vectors of infectious diseases. However, identification failures are not rare, hampering its wide use. Failure is generally attributed either to the absence of respective counter species MS spectra in the database or to the insufficient quality of query MS spectra (i.e., lower intensity and diversity of MS peaks detected). To avoid matching errors due to non-compliant spectra, the development of a strategy for detecting and excluding outlier MS profiles became compulsory. To this end, we created MSProfileR, an R package leading to a bioinformatics tool through a simple installation, integrating a control quality system of MS spectra and an analysis pipeline including peak detection and MS spectra comparisons. MSProfileR can also add metadata concerning the sample that the spectra are derived from. MSProfileR has been developed in the R environment and offers a user-friendly web interface using the R Shiny framework. It is available on Microsoft Windows as a web browser application by simple navigation using the link of the package on Github v.3.10.0. MSProfileR is therefore accessible to non-computer specialists and is freely available to the scientific community. We evaluated MSProfileR using two datasets including exclusively MS spectra from arthropods. In addition to coherent sample classification, outlier MS spectra were detected in each dataset confirming the value of MSProfileR.


Introduction
The matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI-TOF MS) tool, which was largely employed for protein identification, was later applied for rapid identification of whole bacteria based on the comparison of spectral patterns [1,2].The principle of MALDI-TOF MS profiling is to match MS spectra, resulting from laser ionization of the sample, with a library of reference MS spectra for microbial identification [3,4] The cost-effectiveness and rapidity of the MALDI-TOF MS approach has revolutionized routine diagnostic identification of bacteria using simple and reliable procedures, substituting Gram staining and biochemical traditional methods.It has since been largely introduced in clinical laboratories for the identification of micro-organisms including bacteria, fungi and yeasts [5,6].
In the mid-2010s, this innovative tool was successfully applied for the identification of several arthropod families [7][8][9][10], vectors of infectious disease such as mosquitoes or ticks [11][12][13].Correct classification of specimens requires that MS spectra are intra-species reproducible and inter-species specific.Unlike molecular assays, MS protein profiles from a same arthropod specimen could vary according to body part, developmental stages or sample preparation mode [10,[14][15][16].Then, to compare and to share results of MS profiling analyses, a standardization of the body parts selected for MS submission according to developmental stage and the conditions of sample homogenization for mosquitoes and ticks were established [8,[17][18][19].The legs and thorax for mosquitoes and the legs and half-idiosoma for ticks were the two body parts selected per specimen and submitted independently to MALDI-TOF MS for species identification [20][21][22][23].The identification reliability was expressed with scores provided by algorithms matching MS spectra from the query sample with a library of reference MS spectra.
Despite the standardization of protocols, for some MS spectra, their query against the reference DB failed to reach the threshold score value established for reliable identification [24,25].This failure was generally attributed either to the absence of respective counter species MS spectra in the database or to the insufficient quality of queried MS spectra (i.e., lower intensity and diversity of MS peaks detected).The low quality of MS spectra could be attributed to several factors, such as the mode or duration of sample storing, sample preparation, sample loading onto the MALDI plate, the quality of the matrix buffer or the instrumental variations [24,26].Currently, no pre-processing tests are integrated into the software proposed by manufacturers of MALDI-TOF MS profiling tools (e.g., Bruker Daltonics and Shimadzu) to assess quality of MS spectra.The classification of MS spectra as being of low quality relies essentially on visual inspection of their profile and only concerns those that did not reach the threshold score value established for reliable identification [24,25].So, this classification depends on the reference DB and on the experimenter's skill [27], which could be highly subjective.The classification of a MS spectrum as low quality or its exclusion should be based on rigorous criteria and a reproducible method.
Then, for a prospective adoption of this innovative method (i.e., MALDI-TOF MS profiling) for arthropod identification by the scientific entomologist community, it becomes imperative to provide integrated bio-informatics tools distinguishing conform from unacceptable MS spectra.In this way, the objective of the present study was to create a bio-informatics tool, using the R environment [28], that helps to determine the quality of MS spectra rigorously and before interrogation, and which integrates a complete MS spectra analysis pipeline including peak detection and MS spectra comparisons.In addition, it was essential to add metadata concerning the sample from which the spectra originate and to provide a user-friendly graphical interface using the R Shiny framework.The advantages offered by this innovative bio-informatics tool called MSProfileR for analysing MS spectra of arthropods will be presented and future developments will be discussed.

General Organization of the "MSProfileR_v1.0" Tool
The MSProfileR tool is a Shiny application assessing the conformity of mass spectra profiles for future analysis.It was developed in R programming language version 4.3.2 and in the RStudio environment version 2023.10.31 (R Core Team, 2020: R Foundation for Statistical Computing, Vienna, Austria.https://www.r-project.org/(accessed on 11 February 2024)).The general workflow of the application was divided into five major steps: (1) the data loading tool that includes spectra import, (2) the preprocessing tool including successive steps to control the conformity and quality of spectra, (3) the processing tool including peak detection and MS spectra classification, (4) the spectra annotation tool allowing to add complementary metadata to each spectrum and finally (5) the output tool generating several files including a report (Figure 1).Each step of the MSProfileR was composed of several modules, each performing one or more tasks.During the pipeline process, the user controls and adjusts the parameters of the MS spectra analysis.However, it is possible to realize the entire workflow without any intervention by using default parameters or loading parameters defined in a previous analysis.To make MSProfileR easier to use, a graphical user interface (GUI) was created using the Shiny R package version 1.7.2.A wrapping of existing R packages, available in Comprehensive R Archive Network (CRAN) repositories, was conducted.Shiny R package allows for the building of an interactive web application straight from R. The different steps and modules from MSProfileR are described below.
The MSProfileR tool is a Shiny application assessing the conformity of mass spectra profiles for future analysis.It was developed in R programming language version 4.3.2 and in the RStudio environment version 2023.10.31 (R Core Team, 2020: R Foundation for Statistical Computing, Vienna, Austria.https://www.r-project.org/(accessed on 11 February 2024)).The general workflow of the application was divided into five major steps: (1) the data loading tool that includes spectra import, (2) the preprocessing tool including successive steps to control the conformity and quality of spectra, (3) the processing tool including peak detection and MS spectra classification, (4) the spectra annotation tool allowing to add complementary metadata to each spectrum and finally (5) the output tool generating several files including a report (Figure 1).Each step of the MSProfileR was composed of several modules, each performing one or more tasks.During the pipeline process, the user controls and adjusts the parameters of the MS spectra analysis.However, it is possible to realize the entire workflow without any intervention by using default parameters or loading parameters defined in a previous analysis.To make MSProfileR easier to use, a graphical user interface (GUI) was created using the Shiny R package version 1.7.2.A wrapping of existing R packages, available in Comprehensive R Archive Network (CRAN) repositories, was conducted.Shiny R package allows for the building of an interactive web application straight from R. The different steps and modules from MSProfileR are described below.
Figure 1.MSProfileR workflow.MSProfileR was divided into five major parts (bold text), including the data loading, the preprocessing, the processing, the annotation and the output tasks.Each part was organized in several modules (box texts with solid line), processing one or more tasks in cascade, by transferring the intermediate results to the subsequent module until the procurement of the final data.The user can select and adjust parameters throughout the pipeline process (box texts with dotted line).The main results generated by the different modules are collected in the reporting modules and the others outputs could be either downloaded separately or in a unique file. (a) Validation of high quality spectra and exclusion of noncompliant spectra. (b) Annotation of each spectra by supplementing metadata table. (c) Parameters from a previous analysis could be uploaded.Figure 1.MSProfileR workflow.MSProfileR was divided into five major parts (bold text), including the data loading, the preprocessing, the processing, the annotation and the output tasks.Each part was organized in several modules (box texts with solid line), processing one or more tasks in cascade, by transferring the intermediate results to the subsequent module until the procurement of the final data.The user can select and adjust parameters throughout the pipeline process (box texts with dotted line).The main results generated by the different modules are collected in the reporting modules and the others outputs could be either downloaded separately or in a unique file. (a) Validation of high quality spectra and exclusion of noncompliant spectra. (b) Annotation of each spectra by supplementing metadata table. (c) Parameters from a previous analysis could be uploaded.

Data Loading
Data loading, which is the entry point to MSProfileR, contains a module for reading MS spectra, as well as an optional module for selecting and loading the setting parameters of a previous analysis.

Module of the MS Spectra Loading
Functions from the MALDIquant v.1.22.2 and MALDIquantForeign v.0.14.1 [29,30] packages were used for retrieving MALDI-TOF spectra on the R system.These packages recognize MALDI-TOF mass spectra file types and load them automatically [29].Different file formats (txt, tab, csv, fid, ciphergen, mzXML, mzML, imzML, analyze, cdf and msd) can be uploaded, in an automatic detection way, from a folder containing one or many spectra.However, as only MS spectra from Bruker instruments were available, the assessment of the MSProfileR tool was conducted with these file formats.The spectra files are loaded recursively, from which are retrieved, the m/z values and intensity values, in the xand y-axis, respectively, plus several data points related to each spectrum such as the spectrum name, the date of acquisition and the sample name.The sample name must be identical to the spectra of the biological replicates loaded by MSProfileR.Effectively, this parameter will be used during the preprocessing to generate automatically an average spectrum representative for the replicates originating from the same sample.
To download the spectra profiles correctly, the level number of each main folder path must be defined before loading the data.This path is the number of folders existing between the sampleName and the binary files (fid or aqu), which may change between a direct classification of spectra by Bruker Daltonics Software and a manual spectra folder classification by the user if the sample name was not indicated before spectra acquisition.By default, the level of the spectra folders set on the MSProfileR front is four, but the user can change it.Moreover, an ID number was attributed to each spectrum which will be useful to follow and to display spectra during the whole pipeline analysis.

Module of Parameter Loading
This module is used to upload all the parameters saved from MSProfileR in a previous analysis.The user can submit distinct datasets to MSProfileR while applying the same parameters, enabling reproducibility.

Preprocessing
The preprocessing step of the workflow is composed of several modules aiming to evaluate the quality of the spectra and to exclude inconsistent MS profiles (Table 1).All these steps are designed with functions of the MALDIquant package [29], with the exception of the quality control step which is based on the MALDIrppa package v.1.1.0-2[31].The MSProfileR tool allows for assessing the MS spectra quality through four successive steps: (1) trimming and conformity tests, (2) spectra cleaning, (3) quality control and visualization of potentially outlier spectra and (4) replicate averaging.All these tasks are using raw mass spectra data.Spectra passing these quality control steps are then selected for processing.

Module of Trimming and Conformity Check
The first task allows for fixing the low and high limits of spectra m/z values using the trim function and inspects conformity of the spectra.Without specification, the limits of m/z values of default parameters correspond to a range from 2 to 20 kDa.The inspection of data conformity checks whether errors occurred during MS spectra data acquisition.It assesses three criteria: i.
Completeness: Are there any empty spectra (i.e., no data to load)?ii.Missing values: Are there any spectra with irregular m/z values?Normally, the interval between two successive m/z values should remain equal or increase uniformly (i.e., no missing point or aberrant values).iii.Spectra range: Do the lengths (i.e., m/z values range) of the spectra differ?Empty or irregular spectra, or spectra with too different an m/z range, can compromise the subsequent steps and must therefore be detected.These steps controlled the consistency and conformity of the spectra.The list of non-compliant spectra was removed before continuing the analysis.

Module of Spectra Cleaning
The MS spectra that have passed the previous tests are then subjected to a series of transformations in order to standardize the data: (1) intensity transformation, (2) smoothing, (3) baseline correction and (4) normalization.The intensity transformation stabilizes the variance and improves the graphical visualization of spectra, by reducing the scale effect flattening the low-intensity peaks.By default, this transformation is performed using the square root method [32], but three log transformations (log, log2 and log10) are also proposed [33,34].
Mass spectra typically contain a mixture of noise and signal.To increase the signal-noise ratio, algorithms are used to improve the measure of peak m/z intensity values and facilitate the peak detection.By default, the Savitzki-Golay filter algorithm is used to smooth the spectra with a half window size of 10, but the Moving average is also available [35,36].The noise altering the base level of mass spectra is then corrected with the Sensitive Nonlinear Iterative Peak (SNIP) algorithm (number of iterations, n = 100) [37][38][39] or with the TopHat, Convex hull or Median methods.Subtracting the baseline from the spectra facilitates profile comparison.In order to compare intensities at each m/z value between spectra, normalization is required.Three normalizations are available, the Total Ion Current (TIC) [40], the Probabilistic Quotient Normalization (PQN) [41] or the median methods.The TIC is applied by default to each spectrum and corresponds to setting the sum of the relative abundances of all the ions (m/z peaks) to one.All these transformations are based on the mass spectra processing functions (transformIntensity, smoothIntensity, removeBaseline and calibrateIntensity) of the MALDIquant package.

Module of Quality Control
The subsequent steps consist in detecting and filtering MS spectra considered as no conform (i.e., outliers).To this end, the screenSpectra function from the MALDIrppa package is used.It computes for each spectrum an atypicality score (A score).The calculation of this score is based on robust estimators [42,43].The A score pointed out MS spectra for which peak intensity profiles diverge from the rest of the dataset.Upper and lower tolerance limits are returned at the same time by the function.MS spectra classified above the upper limit correspond to profiles of low intensity and poor resolution, whereas those below the lower limit have high peak intensity profiles [31].To avoid the deletion of profiles harbouring high peak intensity, the lower threshold limit is not considered by default.Nevertheless, the user could reactivate it.
Several parameters of the screenSpectra function are available to find the best estimator used in the calculation of the A score, the threshold and the method from which the cut-off is estimated.Two estimators are available, the median absolute deviation (MAD) and the Q.The MAD is defined as the median of the absolute deviations from the median of the intensities.It is a robust estimator for fairly symmetric distributions.The Q estimator is an improved version of the MAD and is more efficient and adequate for non-symmetric distributions [42].Here, the Q estimator was selected as the default parameter.Five methods (i.e., Rousseeuw and Croux (RC), Hampel, extreme studentized deviation (ESD), boxplot and adjusted boxplot) are available to compute the threshold limit for the detection of outliers [44].By default, the RC method is selected.
The spectra with an A score outside the upper limit are regarded as potentially faulty spectra.By default, the spectra detected as outliers are automatically rejected from further analysis.However, at this step, a graphical interface was added, giving to the user the possibility to visualize all MS spectra and notably the outliers.Based on their visualization, the user could refine the decision to exclude spectra near the A score limits or to keep some outliers.Generally, low-quality spectra are characterized by rippled and indented profiles due to low signal to noise intensities which could be absorbed by the background noise.Following this step, all mass spectra for which profiles were considered as atypical and confirmed by the user are counted and excluded from the next analysis steps.

Module of Spectra Averaging
This module generates and proposes an archetype for MS spectra replicates.When available, the MS spectra replicates that passed the quality control steps were then averaged.An average spectrum is calculated for each sample by applying the mean, the sum or the median methods of its replicates, using the averageMassSpectra function from the MALDIquant R package.Then, one archetype spectrum of the replicates per sample is further analysed.The absence of replicates does not affect the workflow of the MSProfileR application.Spectra without replicates are the archetype profile of the sample.Samples with or without replicates could be downloaded and analysed concomitantly.

Processing
The processing tool consists in performing, successively, the peak detection, the spectra alignment until the spectra visualization, classification and the creation of an intensity matrix for further analyses.

Module of Peak Detection
This module aims to detect MS peaks excluding background noise for each spectrum taking into account the homogeneity of the number of MS peaks detected in the entire dataset.The peak detection is performed on the average spectra using the detect Peaks function from the MALDIquant package.It screens each spectrum with a sliding window in order to find the local maximums.A local maximum is detected as a peak whether its intensity is greater than the noise level, estimated using the MAD by default or super Smoother algorithms, multiplied by a signal-to-noise ratio (SNR), with a half window equal to 20, by default.Indeed, the spectra profiles are discretized and uniquely detected peaks are kept.Each spectrum profile can be regarded as a peak list.By default, the SNR value was set at 2 corresponding to a good compromise for detecting as many peaks as possible, without detecting too many parasite peaks which belong to the background noise.The SNR value is a parameter which can be adjusted by the user.To determine the optimal SNR value, the number of detected peaks and their standard deviation were measured for each SNR value tested, ranging from 2 to 7. A boxplot was generated showing the number of peaks detected per SNR value for the entire dataset.The SNR value for which the number of detected peaks was homogeneous and stable among the dataset was considered as the optimal trade-off.

Module of Spectra Alignment
Prior to align detected peaks from averaged spectra, the determination of reference peaks is required.As no spectrum was selected as reference, the reference Peaks function from the MALDIquant package was used.In such conditions, all peaks from the dataset with an occurrence higher than the threshold (peak frequency), set by default at 0.9, are used as reference peaks for peak alignment, with a default tolerance of 0.002.The strict (default) or relaxed parameters allow us to consider all peaks or uniquely the highest.
The alignment of spectra was performed with the alignSpectra function of the MALDIquant R package.All the peaks detected among the average spectra per sample are aligned and calibrated using one function of the correction phase (warping function), lowess, linear, quadratic or cubic.The lowess method was set as default.
Despite this alignment, very closely detected peaks, but not identical, are binned into one, in accordance with an applied tolerance.This tolerance corresponds to the maximal relative deviation of a peak position (m/z value) to be considered as identical and is set to 0.002 by default.Thus, the peaks for which the difference of the positions (m/z values) is lower than this threshold have their mass values equalized forming one peak.Two binning functions are available, the strict one, for which bins never contain two or more peaks of the same sample, and the relaxed one, for which multiple peaks of the same sample are combined in one bin.By default, the strict binning function was selected.
The filtering step consists of removing peaks that are infrequently detected.A minimum frequency per m/z detected peak should be defined.For example, by setting this parameter to 1, unique peaks common to all spectra were filtered.By default, this value was set at 0.2, corresponding to an elimination of all peak positions found in less than 20% of the dataset.The number of peaks detected is indicated and then an adjustment of this frequency is possible by the user.

Module of Spectra Clustering
This module is used to visualize detected peaks and to ordinate samples according to their profiles.An intensity matrix is generated, where average spectra per sample are represented and used for the classification.In this way, a hierarchical clustering algorithm was applied to generate a clustered heatmap using an unsupervised algorithm, the hclust function and the pheatmap package [45].This package computes bootstrap values to indicate for each query the positioning relevancy in the dendrogram.The columns correspond to the peak positions in m/z and the rows to the sample spectra.The peak intensity values are represented with a rainbow scale.

Spectra Annotation
Optionally, it is possible to annotate averaged spectra by the loading of an annotation table.The information introduced in this table from each sample allows us to enrich metadata associated with each averaged spectra.The annotation table tool is composed of one module.The module allows researchers to retrieve the list of sample names which passed the preprocessing and the processing tests and are then directly included in the first column headed "sampleName" of a downloadable table (.xlsx file format).At this step, as all spectra replicates were already averaged, a unique "sampleName" was associated to each averaged spectrum.In this way, the folder from each averaged spectrum could be easily paired with data from the annotation table using the "sampleName".When the user considered the information added sufficient, the annotated table was imported into MSProfileR tool.The module controls whether all averaged spectra were annotated and also whether some annotations were not paired with averaged spectra.Although the importation of the annotation table is not mandatory, these annotations could be helpful for future comparative results of mass spectra profiles.This module uses the readxl and writexl packages to read and write .xlsxfiles.

Output
The output module is created to guarantee traceability for the user for all the process steps, the methods and parameters selected, but also to download and to save data that can be useful for upcoming analyses.

Reports
The report is implemented in R Markdown language [46], offering an interface to generate a downloadable document in pdf format including information about steps performed; methods selected and parameters applied.Moreover, the information about the spectra dataset, such as the name of the directory file, the number of spectra loaded and the results of pipeline analysis; and notably information about the conformity tests and the quality control steps with the list of excluded spectra (non-conform or outliers).In addition, all representations, such as tables, plots, gelviews and the clustered heat map are included in the report.

Save Setting Parameters
This module allows, thanks to the rjson package, to save in JSON format all methods chosen and associated setting parameters that could be easily downloaded for a future analysis.

Export Intensity Matrix
This module serves to export the data of the matrix intensity into a CSV file format.This matrix was extracted according to two criteria: the sample names are indicated on the ordinate and on the abscissa the m/z values are indicated.This module allows for storage in an HDF5 file of the four main outputs of this analysis corresponding to the parameters selected, the averaged spectra, the intensity matrix and the annotation table.This HDF5 file is in an open-source format supporting large, heterogeneous and complex data.The creation of this file is conducted using the hdf5r library, which generates a file format that can store and manage large, complex and heterogeneous data [47].

Module Downloading All Files
This last module was created so that the user could download all files listed in the five preceding modules.It is then possible to download either some particular data points or all the data generated by the analysis.

The User Interface (UI)
For users unfamiliar with the R language, a user-friendly interactive web interface was created, making it possible to exchange information following a user-machine interaction.The UI or graphical interface was based on the Shiny dashboard template (https://rstudio.github.io/shinydashboard/; accessed on 30 November 2021), using Shiny, a "package that makes it easy to build interactive web apps straight from R and Python" (https://rdrr.io/cran/shiny/;accessed on 29 May 2024).The UI was compartmentalized in five tabs corresponding to the five steps of workflow, including (1) Data loading, (2) Preprocessing, (3) Processing, (4) Annotation and ( 5) Output.The architecture of MSProfileR tool was organized in a modular way (Figure 2).For users unfamiliar with the R language, a user-friendly interactive web interface was created, making it possible to exchange information following a user-machine interaction.The UI or graphical interface was based on the Shiny dashboard template (https://rstudio.github.io/shinydashboard/;accessed on 30 November 2021), using Shiny, a "package that makes it easy to build interactive web apps straight from R and Python" (https://rdrr.io/cran/shiny/;accessed on 29 May 2024).The UI was compartmentalized in five tabs corresponding to the five steps of workflow, including (1) Data loading, (2) Preprocessing, (3) Processing, (4) Annotation and ( 5) Output.The architecture of MSProfileR tool was organized in a modular way (Figure 2).

Web Interface Development Architecture
All R packages' dependencies used for the creation of the MSProfileR tool are available in Bioconductor (https://www.bioconductor.org/;accessed on 1 May 2024) or Comprehensive R Archive Network (CRAN) repositories (https://cran.r-project.org/;accessed on 24 April 2024).The execution of MSProfileR tool modules is in cascade, with a specific modularity allowing for the insertion of new functionalities and tasks without rewriting all R code.MSProfileR was made with the motivation to be used by everyone, especially by individuals who have no programming skills.The installation of the R packages used for the tool building and the opening of the front of the interface are obtained by installing the "MSProfileR" package on the R environment (version ≥ 4.3.0),which load all the needed dependencies for the launch of the application.The application launches in a web browser.

UI of Data Loading Tab
The interface of the data loading tab was divided into two panels corresponding to the different modules from this tab (Supplementary Figure S1, Table 1).The MSProfileR tool starts with the loading of spectra.The level number of the binary files path set at four by default should not be modified before spectra loading.The user selects the directory containing one or many spectra to analyse.The data related to each spectrum are gathered into a summarizing table, including the spectra ID, a number generated in a programmatic way for each spectrum, the sample name, the replicate name and the acquisition date.The sample name will be used to detect sample replicates when available.The data from ten spectra are presented per table.Spectra are classified per loading order and the others are available by the page selection at the bottom of the table.The number of loaded spectra as well as the path are indicated at the top of the table.Thanks to the table, the user can control whether the sampleName and replicateName columns are correctly filled.If a shift is noticed, this problem can be rectified by adjusting the level number of the binary files path until the correct classification is obtained.
The second panel is optional.The user has the possibility at this step to upload setting parameters used in a previous analysis.This option allows the user to analyse the current spectrum dataset with the same parameters applied previously for another dataset.

UI of Preprocessing Tab
The preprocessing tool is represented on the interface by five panels corresponding to the four modules of this tab (Supplementary Figure S2, Table 1).It consists of successive steps to evaluate the quality of the spectra to exclude inconsistent MS profiles.The quality control module was divided into two panels to distinguish the automatic selection process from the optional manual validation of conform spectra.All these steps are interactive, the user can select a method and associated parameters when available, for dataset analysis.
The first panel consists of trimming and testing the conformity of spectra (Supplementary Figure S2).The lower and upper limits of the spectra range can be determined by the user and are plotted in a graduated scale of m/z (kiloDaltons, kDa).Following this step, conformity tests are automatically conducted and the results are presented in a table.The table rows are coloured in green if all spectra are compliant and in grey if some spectra did not pass the test(s) successfully.Moreover, the number of spectra considered as non-conform per test is indicated and these spectra are automatically excluded from the rest of the analysis.Nevertheless, the user can keep these non-conform spectra by deselecting the checkbox untitled "Exclusion of non-conform MS spectra".
The cleaning panel realizes successive steps to adjust spectra profiles including transformation, smoothing and normalization of intensity plus a baseline correction.The user can select different methods by a radio button for each step and visualize the result of the treatment of each spectrum by a plot through a reactive window.
The panel of spectra quality control allows users to visualize the results of the spectra screening according to the method selected in a plot representing the atypical score (A score) of each spectrum indexed by a number corresponding to their loading order (i.e., spectra ID).The dotted line represents the upper limit of the A score threshold above which spectra are considered as outliers.Numbers corresponding to atypical spectra are then coloured in red, whereas those kept for the rest of the analysis are indicated in blue.Interestingly, it is also possible to detect spectra considered as outliers due to their low A score by deselecting the checkbox "Include spectra below the lower threshold".In such a condition, all spectra obtaining an A score outside the lower or upper limits will be considered as atypical.Nevertheless, based on our background, notably for MS spectra from entomological origin, protein profiles which obtained a very low A score generally correspond to MS spectra with higher peak intensities.To avoid their classification as outliers, uniquely, the upper limit is applied by default.
The validation of spectra classification as conform or atypical remains possible through the selection panel.This panel is composed of two boxes separating spectra according to their compliant status.One box lists the outliers (i.e., "Atypical spectra"), whereas the second box lists spectra kept for the next steps (i.e., "Selected spectra").Spectra in each box are identified by their spectra ID, sample name and A score.Using these box lists, the user can choose a spectrum which is plotted below.By default, atypical spectra are automatically removed from the rest of the analysis.However, after visual spectra checking, the user can decide to keep some "atypical spectra" or to remove some "selected spectra" by moving them from one box to another by using arrow buttons.
The last panel of this tab consists of averaging spectra replicates which succeeded in passing all the preprocessing steps.The user can select one of the three methods to perform the spectra averaging.Averaged spectra are presented in a table containing one representative spectrum per replicated sample, classified by the sample name.The number of averaged spectra is indicated at the top of the table.This averaging reduces the number of spectra to analyse in the processing part.

UI of Processing Tab
The interface of the processing tab is separated into five panels (Supplementary Figure S3).The first panel concerns the peak detection.The user can select one method of a noise estimator and visualize the results of peak detection on a boxplot graph showing the variance of peak number detected per spectra in the dataset for each SNR value from 2 to 7.
Based on this result, the user chooses the optimal SNR value for the current analysis using a graduated slider.A visualization of the peaks detected for each averaged spectrum is available.On the graph, the detected peaks are indexed in ascending order from highest to lowest intensities.
The next three panels correspond to the spectrum alignment modules, which were split in the interface to visualize and to control the outcomes of each step.Firstly, to realize spectrum alignment, the requirement of reference peaks is compulsory.In this way, the user should select a method to obtain reference peaks among the spectra dataset based on two parameters, the minimum of peak occurrence (i.e., frequency) and the alignment tolerance.The number of peaks used as reference and their distribution in the range of m/z values are presented on a plot.Before applying spectra alignment, an adjustment of parameters remains possible by the user to increase the number of the reference peaks.Then, the user launches the alignment process, by choosing one of the four methods of warping functions.The visualization of the alignment is presented by a gelview where the abscissa and ordinate correspond to the peak position (m/z) and averaged spectra, respectively.Peak intensity is represented by a grey scale.Once the peaks are aligned, the next two panels concern the peak binning and peak filtering.For these two steps, different methods and parameters are offered to the user to reduce the number of peaks which will be retained for the creation of the matrix of peak intensities.The results of the binning and the filtering steps are illustrated by gelviews in the UI, and the total number of peaks (m/z values) from the dataset was also indicated at each step.Based on this information, the user can decide either to adjust parameters or to continue the analysis.
The last panel consists of classifying the averaged spectra according to their profiles.In this way, hierarchical clustering is performed and represented by a heatmap linked to a dendrogram.At this stage of the application, an intensity matrix for all the detected peaks is generated for future analysis.Each peak does not necessarily exist in all the spectra.By default, when peaks are missing, intensity values of the spectra are used to fill the intensity matrix.A checkbox allows this filling to be deactivated.In this case, when peaks are missing, NA values are inserted into the intensity matrix.In the heatmap, missing values appear in grey whereas the detected peaks are coloured in a rainbow scale according to the intensity values.

UI of Annotation Tab
This tab divided into three panels and aims to add sample information (metadata) to each averaged spectrum (Supplementary Figure S4).An annotation table (.xlsx) generated automatically by MSProfileR is downloadable in the first panel.Uniquely, the first column "sampleName" of the annotation table is filled with the names of the samples which passed all previous tests.In the subsequent columns, the information of interest is listed.The body part used, the arthropod family, its geographical origin, the method and mode of storing are some examples of added information which could be useful for the next steps of the analysis.Once the annotation table is enriched with metadata, it can be uploaded using the second panel of this tab.Finally, in the annotation processing panel, the success of the annotation and pairing with averaged MS spectra could be controlled in a summary table generated automatically.

UI of Output Tab
The output tab is constituted of six panels (Supplementary Figure S5).The first one serves to download the report generated during the analysis.The next four panels allow users to download the parameters applied, the intensity matrix, the figures and the HDF5 files, including the parameters, averaged spectra, intensity matrix and the annotated table.All these outputs can be downloaded as a single archive in the last panel named "All files".

Assessment of MSProfileR Tool
Although MSProfileR can be applied beyond the entomological domain, its performance was assessed here with two datasets from arthropods.The two datasets of MS spectra were very different from each other, both in terms of sample diversity but also in terms of the number of spectra tested.The first dataset was intentionally heterogeneous and included several distinct arthropod families.The second consisted in a single mosquito genus but included 13 species with a larger dataset.

Use Case N • 1: Arthropod Families
This first dataset is composed of several arthropod species, all laboratory reared.It comprises three arthropod families, mosquitoes (Culicidae), ticks (Ixodidae) and fleas (Pulicidae), including six species at two developmental stages for some of them.According to family and developmental stages, different body parts were submitted to MS analysis.Several recent papers established standardized operational protocols for sample preparation of these arthropod families [17,18,22].Details about the arthropod species used for MS spectra analysis are summarized in Table 2.The breeding conditions and sample preparations are also indicated in this table, with respective references.This first dataset consists in total of 192 MS spectra coming from 48 samples, analysed in quadruplicate.These spectra come from our home-made reference spectra DB [48] and are considered as high quality (Supplementary File S1).The 192 MS spectra were uploaded in MSProfileR.The number of files uploaded and the name of samples can be controlled on the "data loading" tab (Supplementary Figure S1).As it is the first dataset of spectra analysed with this tool, no file containing MSProfileR parameters set in a previous analysis was available.Then, all the parameters were selected in the present analysis.
In the preprocessing tab (Supplementary Figure S2A), MS spectra were trimmed in the range of 2-20 kDa and were submitted to conformity tests (Supplementary Figure S2B,C).All spectra passed these tests.The parameters selected in the spectra cleaning panel were the following: square root method for transforming intensity, Savitzky-Golay method with a half-window size of 10 for smoothing intensity, SNIP method with 100 iterations for removing the baseline and the TIC method for normalizing the intensity of MS spectra.The result of the cleaning steps can be visualized on a plot for each spectrum (Supplementary Figure S2D).The Q estimator and the RC method with a threshold of 3 were computed for the detection of outliers in the quality control panel (Supplementary Figure S2E).In the present dataset, an upper atypical score (A score) limit of 0.59 revealed six spectra among 192 as outliers (3.1%).Interestingly, unticking the "Include spectra below the lower threshold" button did not exclude additional spectra.This underlines that none of the spectra with a low A score were considered outliers.All spectra classified as outliers are automatically removed from the next steps of the analysis.However, it remains possible to visualize all outliers and keep them (Supplementary Figure S2F).Using MSProfileR, all outlier spectra were compared with their respective replicates (Supplementary Figure S6).These spectra with a higher A score (i.e., upper the limit) were clearly confirmed to be of lower quality, particularly because they presented a higher background noise.These spectra came from five distinct samples, two replicates from one Ae.albopictus at larval stage, two spectra from the legs of two Ae.aegypti specimens and two spectra from cephalothoraxes of two Ct.felis specimens.As no sample had four replicates considered outliers, removing the six outlier spectra did not suppress any sample, which was confirmed in the averaging panel (Supplementary Figure S2G).The averaging of selected replicates was conducted with the mean methods, revealing that 48 samples were available for further analysis.
In the processing tab (Supplementary Figure S3A), the MAD method with a halfwindow of 20 was applied for peak detection.Peak detection is directly linked to the signal-to-noise (SNR) value selected.In this dataset, when the SNR increases from 2 to 7, the mean number of peaks detected per spectrum decreases from about 144 to 39, respectively (Supplementary Figure S3B).The choice of the optimal SNR value could be decisive for the next steps of the analysis.To determine the most appropriate SNR value, a comparison of the peak list detected per arthropod family and per body part was carried out based on the SNR value from two to five (Supplementary Figure S7).It was noticed that at an SNR equal to two, numerous peaks were detected between 2 and 8 kDa, among which several are unspecific and correspond to background noise.At an SNR equal to three, the majority of these peaks of very low intensity were deleted.Moving to an SNR equal to four, regardless of the sample types, some peaks which do not correspond to background noise were no longer detected.In such conditions, some information about the protein profile of the sample was lost.To avoid this phenomenon, an SNR value of three was selected for this dataset.At this SNR value, the number of peaks detected per spectrum was 103.3 ± 11.5 (mean ± standard deviation (SD)).
After peak detection, the alignment of spectra was conducted (Supplementary Figure S3C).The strict method with a minimum frequency of 50% and a tolerance of 0.002 was set for selecting reference peaks.A total of 33 peaks met these criteria and were used to align spectra from the dataset by applying the lowess method for spectra warping.For the binning step, the strict method with a tolerance of 0.002 was set and the parameter for peak filtering was a prevalence of at least 20% for an m/z value to be conserved (Supplementary Figure S3D,E).Among the 516 peak positions obtained after the binning step, the filtering reduced this number to 202.Finally, these 202 peaks were used to classify each sample based on their averaged MS profiles using hierarchical clustering (Supplementary Figure S3F).The clustering confirmed that all averaged spectra from the same sample type were grouped per species and body part.Interestingly, the first criterion of classification is the body part followed by the species (Supplementary Figure S8).
In the annotation tab (Supplementary Figure S4A), the list of the sample names of the 48 averaged spectra can be downloaded and it was completed with metadata from each sample prior to being uploaded (Supplementary Figure S4B,C).The annotation processing table allows users to verify the correspondence of the averaged spectra with the metadata (Supplementary Figure S4D).Here, as the number of averaged spectra and annotated samples are identical (n = 48) and as none of them were not paired, then all averaged spectra from dataset N • 1 were annotated (Supplementary File S1).
In the output tab (Supplementary Figure S5A,(a)), several kinds of files could be downloaded.The report file detailed modules, methods and parameters applied, plus information about the dataset N • 1 and the results (tables, plots, boxplot, gelviews and clustered heatmap) of the preprocessing and processing parts (Supplementary Figure S5B,(b)).The parameters can be uploaded as a JSON file and used for analysing a new dataset (Supplementary Figure S5C,(c)).This also ensures the traceability and the reproducibility of the analysis.The plots generated by the application during dataset N • 1 analysis can also be exported (Supplementary Figure S5E,(e)), as well as the intensity matrix (Supplementary Figure S5D,5(d)) and the HDF5 files (Supplementary Figure S5F,5(f)).All the outputs from this dataset are downloadable in one zipped file which is available in Supplementary File S2.

Use Case N • 2: Culex Genus
This second dataset included MS spectra from Neotropical Culex mosquitoes (Diptera: Culicidae) coming from a work recently published [53].The methods and protocols used for mosquito collection, for their morphological and/or molecular identification, for sample preparation and MS submission were detailed in this previous study [53].Briefly, this dataset was composed of MS spectra from legs and thoraxes of 13 distinct Culex species collected in the field, in French Guiana.The number of specimens per species varied according to the availability of the sample, ranging from 1 to 34 (Table 3, Supplementary File S3).Dataset N • 2 has interesting characteristics for the evaluation of MSProfileR.First, it includes cryptic species that are morphologically indistinguishable.Secondly, specimens were collected in the field, so higher inter-sample variations could have occurred among specimens from the same species compared to laboratory-reared specimens.Thirdly, it encompasses a large number of samples: 169 mosquito specimens submitted to MALDI-TOF MS on two distinct body parts (legs and thoraxes) and loaded in quadruplicates, corresponding to a total of 1352 spectra (169 specimens × two body parts × four replicates).Finally, in the previous work using the same dataset [53], some spectra were excluded by the authors based on their visual inspection which were considered as low quality.The MSProfileR tool offers now the opportunity to detect and visualize all spectra considered as outliers and can then automatically exclude them from the dataset.The comparison of the spectra list excluded by the experimenters on the same dataset in the previous work will allow us to assess the relevance of the MSProfileR tool.
For analysing dataset N • 2, the parameters applied for dataset N • 1 were uploaded.Uniquely, the modified parameters compared to dataset N • 1 were specified.In the preprocessing part, all the 1352 spectra passed the conformity tests.However, the quality control steps revealed that 65 spectra (4.8%) exceeded the upper limit of the atypical threshold (A score = 0.72) and were considered as outliers (Figure 3D).Interestingly, all spectra classified as outliers had leg origin (Figure 3A, Supplementary File S4).To check whether MSProfileR tool succeeded in classifying spectra according to species for each body part, averaged spectra were submitted to the peak detection steps by applying the parameters used in dataset N°1.The filtering step revealed that 204 and 194 peaks for legs (Supplementary file S7) and thoraxes (Supplementary file S8) were retained for averaged spectra classification, respectively.Hierarchical clustering revealed that thorax averaged spectra were grouped per species.Solely two Cx.usquatus (#TH_24 and #TH_82) spectra were not clustered with the other spectra from the same species; and the thorax averaged spectra from the unique Cx. adamesi (#TH_2) was classified inside the Cx.dunni group (Supplementary file S8).The clustering of leg averaged spectra appeared less efficient, with several samples intertwining with other species (Supplementary file S7).The lower classification of the leg averaged spectra compared to thorax was attributed, in part, to their lower spectra intensity.Effectively, as shown in the present study and in concordance with previous works, leg spectra were generally less intense than those of thoraxes from the same specimens [22,53].
For thoraxes, among the seven samples, from the previous analysis [53], which did Several studies reported a higher diversity of spectra between body parts than between species [20,54].Moreover, spectra from thoraxes generally had more numerous and higher intensity peaks than the paired leg spectra [22,53].These two factors could likely explain why numerous spectra originating from legs were classified as outliers.To avoid this bias of exclusion, dataset N • 2 was analysed by splitting the list of spectra per body part, legs or thoraxes.
By applying the same parameters, "A" score thresholds of 0.93 and 0.59 were obtained for leg and thorax spectra, respectively (see report in Supplementary Files S5 and S6).A total of 9 and 20 spectra from legs and thoraxes were classified as outliers, respectively.The inspection of these spectra confirmed the low quality of these outlier profiles compared to typical spectra.However, some spectra with a low-quality profile now scored below the threshold (Figure 3E).Then, the threshold parameter of the quality control step was adjusted to 1.5 in order to exclude these spectra of low quality."A" score thresholds of 0.73 and 0.52 were obtained for leg (Figure 3B) and thorax (Figure 3C) spectra, respectively.Details of the spectra listed as outliers are available in the reports of the leg (Supplementary File S7) and thorax datasets (Supplementary File S8).Although 62 (9.2%) and 70 (10.4%)spectra from leg and thorax samples were classified as outliers (Figure 3D), the number of samples excluded remains modest, less than 5%.Effectively, the four spectra replicates from the legs and thoraxes of eight and six specimens were excluded from the analysis.It is interesting to note that the eight specimens for which leg spectra were classified as outliers encompassed those from the two Culex idottus samples which were excluded by the authors due to the low intensity and inter-sample heterogeneity of MS profiles in the previous work [53].The inspection of the other spectra for which the four replicates were classified as outliers from legs or thoraxes confirmed the lower quality of these protein profiles.These samples were then excluded from the analysis.A total of 161 and 163 averaged spectra from legs or thoraxes passed the preprocessing steps, respectively.
To check whether MSProfileR tool succeeded in classifying spectra according to species for each body part, averaged spectra were submitted to the peak detection steps by applying the parameters used in dataset N • 1.The filtering step revealed that 204 and 194 peaks for legs (Supplementary File S7) and thoraxes (Supplementary File S8) were retained for averaged spectra classification, respectively.Hierarchical clustering revealed that thorax averaged spectra were grouped per species.Solely two Cx.usquatus (#TH_24 and #TH_82) spectra were not clustered with the other spectra from the same species; and the thorax averaged spectra from the unique Cx. adamesi (#TH_2) was classified inside the Cx.dunni group (Supplementary File S8).The clustering of leg averaged spectra appeared less efficient, with several samples intertwining with other species (Supplementary File S7).The lower classification of the leg averaged spectra compared to thorax was attributed, in part, to their lower spectra intensity.Effectively, as shown in the present study and in concordance with previous works, leg spectra were generally less intense than those of thoraxes from the same specimens [22,53].
For thoraxes, among the seven samples, from the previous analysis [53], which did not reach the threshold value (LSV > 1.8) for relevant identification, for five of them, two or more of their spectra replicates were considered as outliers, leading to the exclusion of three samples, for which all replicates overtook the "A" score threshold.The two other thorax samples (#TH_233 and #TH_234) from Cx. dunni conserved by the quality control step obtained nearly relevant identification scores, 1.78 and 1.74, respectively, in the previous study [53].The clustering of these last two thorax averaged spectra (#TH_233 and #TH_234) with the other samples from the same species support the conservation of their respective spectra by quality control steps.
Interestingly, the previous work [53] reported that the selection of the top ten of the mass peak list per Culex species and per body part appeared sufficient to discriminate these 13 Culex species with a correct classification higher than 90%.Here, during the filtering step, all peaks with a frequency lower than 20% (i.e., 0.2) across the dataset were excluded from the analysis.In dataset N • 2, for five species, one to five specimens were available which represents less than 1 to 4% of the number of averaged spectra of dataset N • 2 (161 for legs and 163 for thoraxes).It is then possible that some peaks specific to these species were not included in the intensity matrix due to their low representation (i.e., too few specimens from the same species to reach filtering threshold), which could explain the imperfect clustering of the samples per Culex species, notably for legs.
The decrease in the peak filtering from 20% to 0.5% (i.e., 0.005) led to the inclusion of 878 peaks in the intensity matrix without improving drastically the classification of leg averaged spectra.Among the peaks added in the intensity matrix, about 81.7% (n = 717), some of them are heterogeneous between averaged spectra from the same species which should perturb clustering.MSProfileR then appears well adapted for the detection of atypical spectra but also for their classification with the condition that a subgroup is not too under-representative, which could alter the classification.Moreover, spectra of high intensity remain essential for relevant classification.
The duration of computational analyses is another parameter to take into account.Generally, the quickness of the analysis is directly linked to the power of the computer used.In the present work, the analysis was performed on a laptop with a classical configuration (Processor: Intel (R) Core (TM) i7-10510U CPU @ 1.80 GHz, Hard disk: U+1F5B4 RAM: 16 GiB, Graphics card: 00:02.0VGA compatible controller: Intel Corporation UHD Graphics (rev 02), Operating System: Ubuntu 20.04.).The complete pipeline process for analysing the 1352 spectra of dataset N • 2 took less than four minutes, by applying default parameters.This processing speed allows the user to compare/adjust methods and parameters throughout the workflow without a loss of time.

Discussion
MALDI-TOF MS is capable of rapidly producing large volumes of extremely rich information.It is currently used in microbiology routine diagnosis laboratories for the identification of microorganisms, including bacteria, fungi, yeasts, filaments and, more recently, for the identification of clinical tick samples collected on human hosts [55][56][57].In microbiology, microorganisms are generally cultivated using standardized procedures prior to identification by MALDI-TOF MS [2].For medical entomology studies, prior to arthropod specimen identification by MS, several factors can alter the quality of MS spectra, such as sample storing duration, storage conditions (temperature, with or without a buffer. ..) or sample preparation mode [14,58].To overcome these limitations, the establishment of a standardized protocol and the development of a reproducible data scientific workflow to treat this kind of spectra appeared compulsory.
In the last decade, in order to improve the reproducibility and increase the noise-signal ratio of arthropod intra-species MS spectra, several guidelines have been proposed for sample preparation prior to MS analysis, notably for mosquitoes [15,17], and a consensus strategy seems to have emerged [22].Some informatics tools were developed by private companies (e.g., MALDI Biotyper from Bruker, Saramis from Biomerieux) for MALDI-TOF MS spectra investigation [59,60].They are suitable for spectra analysis, based on spectral matching with a database of reference spectra, but some functionalities that seem essential, such as quality control or annotation of spectra, are missing, or, when available, the methods and parameters applied remain a black box [61].As the identification success is essentially linked to the quality of sample MS spectra, the exclusion of spectra considered as nonconform should make it possible to save time and improve the relevance of classification.Some freely available packages were created by computational specialists to import and pre-process spectra raw data [29] or to filter low-quality spectra among a dataset [31].However, as they use R language, computer knowledge is required.
In the present study, the MSProfileR tool was created to perform preprocessing including the filtering of non-compliant spectra, the classification and the annotation of MS spectra by the use of the R language and packages, notably MALDIquant and MALDIrppa [31].Thanks to the R Shiny framework, which enables user-friendly interfaces to be built for the R environment, the MSProfileR tool offers rapid analysis and ease of use.The pipeline of analysis offered by MSProfileR consists of successive modules, and the tasks of each module offer several methods.Users can select the optimum method for their process and can also adjust its parameters to obtain the finest result.The consequences of each adjustment are presented by plots, graphics, tables or other illustrations on the Shiny interface.Throughout the workflow, users can then visualize each task of the pipeline and keep control of all processes by adjusting parameters or changing methods.By recording the methods used and their parameters at the end of the analysis, they can be traced and reused.In this way, the same methods and parameters can be applied to a new dataset, reducing the experimenter's intervention time and ensuring consistency in the analyses.
Detecting aberrant spectra is an essential step in the analysis.The MSProfileR tool enables rapid and automatic detection, which is a considerable advantage for entomological studies.Until now, this detection was based on visual comparison of spectra and the application of certain criteria such as the diversity of MS profiles or the intensity of the most intense peak (>3000 ua) [59,60].This detection was highly dependent on the skills and experience of the experimenter.In the absence of standardization, the reproducibility of this detection was not guaranteed between different experimenters.The use of MSProfileR on dataset N • 1 revealed that six MS spectra were considered as non-conform.None of these had previously been detected as outliers by the experimenter.As these noncompliant spectra only concerned one repetition out of four per sample, the impact on the mean spectrum used as a reference in the DB remains negligible.However, in the future, excluding outliers before calculating the mean spectrum will improve the quality of the MS database and the relevance of the identifications.
Throughout the analysis workflow, MSProfileR offers interactive visualizations linked to the criteria (methods/parameters) chosen.This means that any changes can be quickly assessed, which was important for detecting non-compliant spectra in dataset N • 2. Adjustment of the quality control parameters revealed that a single parameter setting was not possible because the spectral profiles between the mosquito's leg and thorax differed in intensity and diversity.The dataset was therefore split beforehand according to body part, allowing for good detection of non-compliant spectra.This dataset highlighted the need to homogenize the origin of the samples in order to improve the quality control stage of the spectra.MSProfileR detected replicate MS spectra of the legs of two Culex samples as outliers, confirming the classification made in previous work [48].In addition, new MS spectra were detected as outliers (n = 132 out of 1352).In the previous study [48], the selection was essentially based on the intensity of the most intense peak and the skill of the experimenter.In the previous study, this manual selection was complex and very time-consuming because this second dataset included a total of 1352 spectra.Conversely, the user can now inspect the spectra classified as outliers by MSProfileR and validate or not validate this classification.Detecting new outlier spectra and validating them underlines the tool's efficiency and saves time.By controlling the quality of spectra in a reproducible way and independently of the user, the analysis of spectra and the classification of species will improve.
A decisive parameter remains the peak detection threshold [62], which is linked to the selection of the optimal SNR value.A too low SNR value could lead to the inclusion of peaks corresponding to background noise and inversely a too high SNR value could induce the miss-detection of true peaks, which could alter spectra classification in the next steps.The selection of the SNR value is rarely possible with the majority of commercial bio-informatics tools, and when it is possible, the consequence of SNR value change on peak detection is not easy to judge.Often its choice is highly subjective.MSProfileR now makes it possible to directly see the effect of changing the SNR value on the detected peaks.In a recent study assessing COVID-19 diagnosis using human saliva MS spectra obtained by MALDI-TOF profiling, we reported the interest and importance of selecting the optimal SNR value [63].
An important original feature of MSProfileR is the ability to annotate all the spectra that are averaged after the validation stages.To our knowledge, no commercial or opensource software allows information to be added to each spectrum.MSProfileR makes it possible to associate and store information about the sample analysed (origin, sample type, preservation, preparation protocol, etc.), which is essential for carrying out statistical analyses and storing these spectra as a reference in a database.
Finally, the correct spectra classification by the hierarchical clustering of the heterogeneous dataset N • 1 and the highly homogeneous dataset N • 2 underlined the wide usefulness of this tool for sample comparisons.As spectra from dataset N • 2 originated from a field collection of arthropods without knowledge of the species population encountered, uniquely unsupervised classification is possible.The hierarchical clustering method is then an appropriate strategy for arthropod classification.For mosquitoes, two body parts, the legs and thorax, could be submitted independently to improve specimen identification [20,53].Here, the comparison of dendrograms from paired samples could be informative to verify the concordance of classification.Although the ordination of paired samples differs between dendrograms according to the body part tested, the same sample list clusters on one branch per body part.To maintain the extensibility of the application, the HDF5 file panel was created by storing the averaged spectra of each dataset, their annotation, the parameters used and their intensity matrix.This module will be helpful for the future construction of a reference MS spectra database which will be used for specimen identification by spectral matching [64,65] Recently, a web tool called GeenaR has been published [66].Like MSProfileR, GeenaR offers a complete workflow for the analysis of MALDI-TOF MS spectra with comparable functionalities.The main difference is that for GeenaR, all methods and parameters are defined in a unique web page before running the analysis.GeenaR presents the advantage of being very easy to use and also includes the quality control of MS spectra.Unfortunately, every time the user changes a method or a parameter, the whole application has to be re-run.The high heterogeneity of arthropod MS spectra according to numerous factors (e.g., family, body part, storing mode, duration of storing, sample preparation conditions, etc.) requires an adjustment of software parameters to optimize spectra analysis [17,67].The possibility to visualize each modification of the method or setting parameters rapidly is essential for entomological spectra analyses, notably for the detection of the outliers and the determination of the cut-off value which could vary according to the body part tested for paired species samples, as observed in dataset N • 2. MSProfileR makes it possible to see the consequence of each parameter variation as soon as it is modified.MSProfileR is then completely well adapted for the analysis of MS spectra datasets including a high diversity of protein profiles which occur among arthropod species, from distinct families (e.g., dataset N • 1) but also among species from the same family (e.g., dataset N • 2).

Conclusions
In developing the MSProfileR tool, we created a MALDI-TOF MS scientific workflow with functionalities adapted to investigate arthropod vector populations.Its main functionalities are the numerous quality control tests on the spectra, the classification of averaged spectra and their annotation.The advantage of MSProfileR resides in its semi-automatic processing allowing for interventions of the analysis to improve the quality of the results.Throughout the pipeline, the user can visualize and control all the tasks and quickly adjust each parameter.This easy-to-use Shiny graphical app is accessible with a web browser application and is accessible on Windows, macOS and Linux, thanks to the MSProfileR R package.This package is available on the Github platform.MSProfileR is therefore a user-friendly tool that analyses spectral data, without a need for programming expertise.MSProfileR seems to be a promising tool for analysing MS spectra from arthropods of public health importance like mosquito and tick vectors.MSProfileR is an open-source software that can be used by the scientific community, particularly entomologists.

Figure 2 .
Figure 2. Tree overview of the application architecture.The MSProfileR tool was created under R language with a Shiny interface.The scripts was divided into three parts, the user interface (UI), the server and reporting scripts.2.7.1.Web Interface Development Architecture All R packages' dependencies used for the creation of the MSProfileR tool are available in Bioconductor (https://www.bioconductor.org/;accessed on 1 May 2024) or Comprehensive R Archive Network (CRAN) repositories (https://cran.r-project.org/;accessed on 24 April 2024).The execution of MSProfileR tool modules is in cascade, with a specific

Figure 2 .
Figure 2. Tree overview of the application architecture.The MSProfileR tool was created under R language with a Shiny interface.The scripts was divided into three parts, the user interface (UI), the server and reporting scripts.

Informatics 2024, 11 , x 7 of 27 Figure 3 .
Figure 3. Assessment of quality control step on the dataset No°2.Graphical representation of the atypical score of all spectra (A) of leg spectra (B) and of thorax spectra (C) from dataset 2. Typical and outlier spectra were indicated by blue and red numbers, respectively.The criteria and results of spectra classification are indicated (D) Representative spectra classified as outliers and typicals for all dataset (E) for leg (F) or for thorax (G) samples.Respective A score are indicated into brackets on each spectra.

Figure 3 .
Figure 3. Assessment of quality control step on the dataset No • 2. Graphical representation of the atypical score of all spectra (A) of leg spectra (B) and of thorax spectra (C) from dataset 2. Typical and outlier spectra were indicated by blue and red numbers, respectively.The criteria and results of spectra classification are indicated (D) Representative spectra classified as outliers and typicals for all dataset (E) for leg (F) or for thorax (G) samples.Respective A score are indicated into brackets on each spectra.

Table 1 .
Overview of the organization of each tabs of MSProfileR tool user's interface.
* Method used by default in MSProfileR tool.(a)Ahuman-readablefileformat stores previous entered parameters by default.(b)HDF5file is a special file with a database function to store all the raw data introduced during the analysis.ESD, extreme studentized deviation; HDF5, Hierarchical Data Format version 5; kDa, kilo Dalton; MAD, median absolute deviation; PQN, Probabilistic Quotient Normalization; RC, Rousseeuw and Croux; SNIP, Sensitive Nonlinear Iterative Peak; SNR, signal-to-noise ratio; min, minimum; Sqrt, square root method; TIC, Total-Ion-Current.

Table 2 .
Overview of arthropod species selected from "dataset N • 1" for MS spectra analysis, including reference list for breeding or samples preparation.For larval stage the totality of the specimen was submitted to MS analysis.* Number of specimens submitted to MS analysis.# As each specimen were loaded in quadruplicate on MS plate the total number is 192 MS spectra.Ae. aegypti; Am. amblyomma; An. anopheles; Ct. ctenocephalides; Rh. rhipicephalus. $

Table 3 .
[53]view of Culex mosquito species selected to compose the "dataset N • 2" of MS spectra *.Dataset obtained from Costa et al.[53].§Number of specimens used to create the reference MS database per body part according to Costa et al.[53].$ Number of specimens submitted to MS analysis.# As two body parts (legs and thoraxes) from each specimen were loaded in quadruplicate on MS plate, the total number is 1352 MS spectra.Cx., Culex; Cux., Culex; Mel., Melanoconion. *