Proteogenomics Analysis Reveals Novel Micropeptides in Primary Human Immune Cells

Short open reading frames (sORFs) encoding functional peptides have emerged as important mediators of biological processes. Recent studies indicate that sORFs of long non-coding RNAs (lncRNAs) can encode functional micropeptides regulating immunity and inflammation. However, large-scale identification of potential micropeptide-encoding sequences is a significant challenge. We present a data analysis pipeline that uses immune cell-derived mass spectrometry-based proteomic data reanalyzed using a rigorous proteogenomics-based workflow. Our analysis resulted in the identification of 2815 putative lncRNA-encoded micropeptides across three human immune cell types. Stringent score cut-off and manual verification confidently identified 185 high-confidence putative micropeptide-coding events, of which a majority have not been reported previously. Functional validation revealed the expression and localization of lnc-MKKS in both nucleus and cytoplasmic compartments. Our pilot analysis serves as a resource for future studies focusing on the role of micropeptides in immune cell response.


Introduction
Long non-coding RNAs (lncRNAs) are a heterogeneous group of transcripts that are over 200 nucleotides in length and are not translated [1]. lncRNAs have been ascribed functions through their control of proximal and distant gene expression and their regulation of splicing, turnover, translation, and signaling pathways [1,2]. The role of lncRNAs in regulating immunity is wide-ranging, with them being involved in cell development, immune response, and host-pathogen interactions [3]. Though the strict definition of lncRNAs deems them to be non-coding, there is growing evidence showing some genuine lncRNAs encode functionally important micropeptides from short open reading frames (sORFs) while some micropeptide-encoding genes are misannotated as lncRNAs [4,5]. Several studies have reported the involvement of lncRNA-encoded micropeptides that are functional in adaptive and innate immune responses. For example, a 17-amino acid micropeptide encoded by lncRNA MIR155HG was recently observed to be highly expressed by antigenpresenting cells and was found to modulate the major histocompatibility complex (MHC) class II-mediated antigen presentation and T-cell priming [5]. Further, a micropeptide encoded by lncRNA 1810058I24Rik was reported as essential for activating the Nlrp3 inflammasome in macrophages [6]. In addition, a micropeptide translated from the noncanonical start site of lncRNA Aw112010 was found to be important in immunity against Salmonella infection in mice [7]. These studies underscore the importance of functionally relevant micropeptides encoded by annotated lncRNAs that remained undetected using other methods. While several approaches exist to mine OMICS data for micropeptides [8], proteogenomics methods that use the generation of three-frame translation of lncRNAs to search mass spectrometry data hold promise for identifying bonafide micropeptide candidates expressed from both canonical and non-canonical start site. Proteogenomics is one of the emerging areas of 'OMICS' technologies which involves the correlation of proteomic data with genomic and transcriptomic data for a better understanding of the genome [9]. Proteogenomics can also be carried out through analyzing mass spectrometry data using six-frame translated genomic or three-frame translated transcriptomic databases [10]. While proteogenomics methods have been widely used to refine genome assemblies and explore molecular mechanisms of diseases such as cancers [10], it has relatively been less explored in the context of innate immunity and inflammation. Given the functional importance of lncRNA encoded micropeptides in regulating immune responses, in-depth analysis will likely provide the means to explore their potential applications as novel therapeutics for controlling immune responses in diseases. To this end, we provide a framework of proteogenomics pipeline integrating data obtained from several tools for the identification of potential lncRNA encoding micropeptide candidates from immune cells.

Results
To identify potential lncRNA-encoded micropeptide candidates in immune cells, we designed a computational pipeline combining mass spectrometry and transcriptome data ( Figure 1A).   Briefly, the lncRNA database obtained from LNCipedia was used to generate a 3-frame translated lncRNA database to search mass spectrometry-based proteomics data on immune cells obtained from the PRIDE repository. The datasets consisted of two studies using primary human monocyte and secretomes of dendritic cells, monocytes, and macrophages (Supplementary Table S1). Overall, our analysis led to the identification of 2815 nonredundant putative micropeptides (Table 1) across three immune cell types. Since sORFs encode shorter-length peptides, a large majority of lncRNA encoding sORFs were identified by single peptide evidence.
We next compared the list of lncRNA-encoded peptides identified in the current study with the proteogenomics results published as part of the human proteome draft map study [14] and found an overlap of five peptides. The details of these are provided in Table 2. The relatively low overlap between these studies may be due to the fact that, while the draft of the human proteome used an older generation accurate-mass mass spectrometer (Orbitrap Velos), the data we used (Rieckmann et al.) for the analysis were derived from the new generation accurate-mass mass spectrometer (Q Exactive HF mass spectrometer), which resulted in a deeper and more high-resolution proteomic profile. Further, the human proteome data contained only monocytes, while the data of Rieckmann et al. contained 28 primary human hematopoietic cell populations. Among these, we identified NDDIPEQDSLGLSNLQK peptide, encoded by lnc-MKKS in monocytes. This was previously proposed as a novel coding region by the human draft proteome [14] and recently also reported by Nomura and Dohmae et al. [15]. The current study found it to be a potential micropeptide encoded in the UTR region of the MKKS gene ( Figure 1B,C). It is now being considered as an exon that encodes an alternate translated variant of MKKS (NP_001381077.1). Domain analysis found a transmembrane helix region in the sequence. Secretome analysis of the lnc-MKKS micropeptide using SecretomeP2.0 predicted the presence of a signal peptide as well as the high odds of non-classical secretion (NN-score of 0.815, Odds of 3.701). These findings suggest the ability of the micropeptide to participate in functions such as enzyme catalysis, transport across membranes, receptor signal transducer, or energy transfer activities. The peptide encoding lnc-MKKS was found to be highly conserved in mammals ( Figure 1D).
The rest of the candidates did not show any discernible signal and may likely be influenced by several factors. We next sought to investigate the subcellular localization of lnc-MKKS micropeptides for the first time to understand their role in cells. To achieve this, cells transfected with lnc-MKKS-Flag were subjected to confocal microscopy analysis, which indicated that lnc-MKKS micropeptide was localized to both the nucleus and cytosol of 293T cells at basal condition ( Figure 2C,D, Supplementary Videos S1 and S2). The rest of the candidates did not show any discernible signal and may likely be influenced by several factors. We next sought to investigate the subcellular localization of lnc-MKKS micropeptides for the first time to understand their role in cells. To achieve this, cells transfected with lnc-MKKS-Flag were subjected to confocal microscopy analysis, which indicated that lnc-MKKS micropeptide was localized to both the nucleus and cytosol of 293T cells at basal condition ( Figure 2C,D, Supplementary Videos S1 and S2).

Discussion
lncRNAs have been ascribed various roles in biological processes and have been studied in the context of several diseases. Increasing evidence now suggests the coding

Discussion
lncRNAs have been ascribed various roles in biological processes and have been studied in the context of several diseases. Increasing evidence now suggests the coding potential of many lncRNAs mainly through the use of ribosome profiling and nascent chain sequencing studies [31,32]. Proteogenomics approaches can also be used to provide evidence for the coding potential of lncRNAs and have been primarily applied in the context of cancer cells [4,11,33]. Considering that several functional micropeptides have been identified in immune cells [5,6], the use of proteogenomics can lead to the discovery of putative functional micropeptide candidates and transform therapy for immunological diseases. Several studies have implicated the important role of micropeptides in immunity and inflammation. For example, a study by Niu and colleagues identified that a micropeptide encoded by lncRNA MIR155HG can suppress autoimmune inflammation by modulating antigen presentation [5]. Further, Bhatta and colleagues identified a 47-amino acid mitochondrial micropeptide that could activate the Nlrp3 inflammasome [6]. These studies highlight the role lncRNA-encoded micropeptides could play in controlling innate immune responses through the identification of novel signaling mechanisms. The current study sought to build a proteogenomics pipeline to identify potential micropeptides in immune cells in a high throughput manner.
This study demonstrates that mining immune cell proteomics data using a proteogenomic approach can discover putative micropeptides encoded by long non-coding RNAs. Although the current study identified over 2800 potential micropeptide events, these cannot be investigated further due to the lack of literature regarding these micropeptides. Only one of the candidates-lnc-MKKS-was previously identified in the draft map of the human proteome [14] as well as in 16 other publicly available PRIDE (PXD) proteomics datasets using the Global Proteome Machine database (GPMDB, gpmdb.thegpm.org), validating our pipeline. This necessitates further studies on the high-throughput identification of micropeptides in immune cells.
We tried to find out if some of these micropeptides could be expressed in in vitro models such as the HEK293T cells. Out of seven candidates, lnc-MKKS was expressed at high levels using ectopic expression, while lnc-RMST was expressed at low levels. The relatively low number of positive validations may be explained by probable immune cell specificities of these candidates. The lnc-MKKS micropeptide was found to localize to both the nucleus and cytoplasm in HEK293T cells.
The current study has a few limitations. The absence of literature on high-throughput experiment-based identification of micropeptides is a major challenge. In addition, the functions of most of these micropeptides are yet to be explored. While the current study transiently expressed these epitope-tagged micropeptides, the study of endogenous micropeptide in cells using specific antibodies may yield more information pertaining to their functions. Their expression could also depend on specific transcription factors and cofactors, thereby leading to cell-type-specific expression. In addition, interactome studies on these micropeptides could provide valuable information on the signaling aspects and function. The current study used a false discovery rate of 1% at peptide level, which is routinely used in mass spectrometry-based proteomic analysis, and, therefore, there is a significant probability that many of these findings are not true. We have tried to overcome spurious findings by checking the quality of the spectra for all the candidate peptides. However, the benefits of using the pipeline outweigh the issues of false positivity, as the identification of even a few bonafide micropeptides could result in a significant impact on the knowledge pertaining to the regulation of innate immune/inflammatory responses, which could, in turn, lead to massive implications for the treatment of immune disorders.
In conclusion, the current study describes a proteogenomic pipeline to detect several putative micropeptides with experimental validation of one of the hits. The absence of literature on the majority of these events constitutes a significant bottleneck towards identifying their function in biological processes, limiting the current functional studies to an individual hit at a time. This work adds a new method for discovering putative micropeptides, paving the way for studies focused on the high-throughput identification and validation of functional micropeptides.

Bioinformatics Analysis
The human lncRNA database was fetched in gtf format from LNCipedia 5.2 (https://lncipedia.org/, accessed on 4 October 2018) [34] and was parsed to generate a FASTA file using Cufflinks (http://cole-trapnell-lab.github.io/cufflinks/file_formats/). The lncRNA sequences were translated in three-frame using EMBOSS (http://emboss. sourceforge.net/apps/release/6.6/emboss/apps/transeq.html). Proteomic data pertaining to human monocyte and secretomes of dendritic cells, monocytes, and macrophages were obtained from publicly deposited data available in PRIDE from studies [35,36] (Supplementary Table S1). The proteomics data were searched against the three-frame translated database using Proteome Discoverer version 2.2 (Thermo Scientific, Bremen, Germany), using the Sequest HT and MS Amanda 2.0 search algorithms. The search parameters included trypsin as a proteolytic enzyme with one maximum missed cleavage specified. Carbamidomethylation of cysteine was specified as fixed modifications, and the oxidation of methionine and protein N-terminal acetylation were specified as dynamic modifications. The minimum peptide length considered was seven amino acids. Precursor mass tolerance of 10 ppm and fragment mass tolerance of 0.05 Da were used for the search. The data were searched against the decoy database with 1% FDR at the peptide level. Peptides with >1 PSMs and XCorr score of ≥2 were considered for further analysis. The score XCorr of SEQUEST algorithm represents the cross-correlation of experimental and theoretical spectra to say that the peptide has been positively identified. An XCorr of >2 is usually indicative of a good correlation. It had been reported earlier that to reach a false-positive rate of 1.0% in the case of full tryptic peptides, Xcorr should be larger than 2.0 for a charge state of +1 [37]. Protein BLAST was carried out, and peptides matching known proteins (entries existing in RefSeq protein database) were removed from the list. The peptides were then categorized based on their positions near known genes using the BLAT UCSC browser (https://genome.ucsc.edu/cgi-bin/hgBlat). The peptides were checked for sequence conservation in other mammals, including EST (expressed sequence tags) evidence. Bonafide micropeptides were listed, and manual validation of MS/MS spectra for these sequences was carried out. The conservation of lnc-MKKS-4 across orthologues was aligned in Ensembl, and the alignment was plotted using Interactive Tree of Life (https://itol.embl.de). Domain analysis of lnc-MKKS-4 putative micropeptide was carried out using SMART (http://smart.embl-heidelberg.de/). The micropeptide was assessed if it was secreted using the SecretomeP 2.0a Server (http://www.cbs.dtu.dk/services/SecretomeP/).

ORF Cloning and Confocal Microscopy
Gene fragments for selected putative micropeptide-encoding ORFs with C-terminus FLAG-tag were synthesized by Integrated DNA Technologies (Coralville, IA, USA). The PCR primers listed in Supplementary Table S2 were used to amplify these ORFs with the addition of BglII and XhoI restrictions sites on the 5 -and 3 -end, respectively, to insert these ORFs into PMSCV PIG expression vector (Addgene; #21654). Either an empty vector or a micropeptide-ORF inserted vector was transiently transfected into HEK 293T cells using GeneJuice (Millipore Sigma, Burlington, MA, USA; #70967). At 72 h post-transfection, micropeptide expression was assessed by Western blot using antibodies against Flag (Sigma, Saint Louis, MO, USA; #A8592) and β-actin (Cell Signaling Technology, Danvers, MA, USA; #5125S). In addition, confocal microscopy was carried out to analyze the expression and localization of lnc-MKKS. Cells were fixed on eight-well chambered slides (155411; Lab-Tek) using ice-cold 4% paraformaldehyde for 15 min and were washed with 1X PBS. The cells were permeabilized using 0.2% Triton X-100 in 1X PBS for 15 min and blocked using 5% Normal Goat Serum (005-000-121; Jackson ImmunoResearch Laboratories, West Grove, PA, USA) in 1X PBS and 0.2% Triton X-100. Cells were incubated with antibodies against Flag conjugated to Alexa Fluor 488 (Cell Signaling Technology, MA, USA; #5407), against tubulin conjugated to Alexa Fluor 647 (Cell Signaling Technology, MA, USA; #5046), against GAPDH conjugated to Alexa Fluor 647 (Cell Signaling Technology, MA, USA; #5046) and DNA dye DAPI (Cell Signaling Technology, MA, USA; #4083). Imaging was carried out using a Leica SP8 Lightning confocal microscope.  Supplementary Table S1. Details of publicly available proteomic datasets used in the current study. Supplementary Table S2. List of PCR primers for putative micropeptide-encoding ORFs. Supplementary Table S3. List of potential micropeptides encoded by 'lncRNAs' and not matching known proteins. Supplementary Data S1. Spectra for putative micropeptide-encoding lncRNAs.
Author Contributions: Y.S. carried out the data curation, data analysis, visualization, writing of the manuscript draft, and prepared figures. A.B. carried out experiments including cloning, Western blot, confocal microscopy. S.M.P. carried out data analysis and writing of the manuscript draft. K.A.F. and R.K.K. conceptualized the study, acquired funds, and reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: This work was funded by the Research Council of Norway (FRIMEDBIO "Young Research Talent" Grant 263168 to R.K.K.; and Centres of Excellence Funding Scheme Project 223255/F50 to CEMIR), Onsager fellowship from NTNU (to R.K.K.); NIH grants T32 AI095213 (to A.B.); AI147208 and AI067497 (to K.A.F.).

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data supporting the results can be found in Supplementary Materials.