A Multiple Protease Strategy to Optimise the Shotgun Proteomics of Mature Medicinal Cannabis Buds

Earlier this year we published a method article aimed at optimising protein extraction from mature buds of medicinal cannabis for trypsin-based shotgun proteomics (Vincent, D., et al. Molecules 2019, 24, 659). We then developed a top-down proteomics (TDP) method (Vincent, D., et al. Proteomes 2019, 7, 33). This follow-up study aims at optimising the digestion of medicinal cannabis proteins for identification purposes by bottom-up and middle-down proteomics (BUP and MDP). Four proteases, namely a mixture of trypsin/LysC, GluC, and chymotrypsin, which target different amino acids (AAs) and therefore are orthogonal and cleave proteins more or less frequently, were tested both on their own as well as sequentially or pooled, followed by nLC-MS/MS analyses of the peptide digests. Bovine serum albumin (BSA, 66 kDa) was used as a control of digestion efficiency. With this multiple protease strategy, BSA was reproducibly 97% sequenced, with peptides ranging from 0.7 to 6.4 kD containing 5 to 54 AA residues with 0 to 6 miscleavages. The proteome of mature apical buds from medicinal cannabis was explored more in depth with the identification of 27,123 peptides matching 494 unique accessions corresponding to 229 unique proteins from Cannabis sativa and close relatives, including 130 (57%) additional annotations when the list is compared to that of our previous BUP study (Vincent, D., et al. Molecules 2019, 24, 659). Almost half of the medicinal cannabis proteins were identified with 100% sequence coverage, with peptides composed of 7 to 91 AA residues with up to 9 miscleavages and ranging from 0.6 to 10 kDa, thus falling into the MDP domain. Many post-translational modifications (PTMs) were identified, such as oxidation, phosphorylations, and N-terminus acetylations. This method will pave the way for deeper proteome exploration of the reproductive organs of medicinal cannabis, and therefore for molecular phenotyping within breeding programs.


Introduction
The recent revised legislation on medicinal cannabis has triggered a surge of medical and clinical research studies evaluating the effect of major cannabis components on human health. C. sativa has been dubbed "the plant of the thousand and one molecules" [1] owing to its propensity to produce a plethora of phytochemicals with myriad of biological activities as well as fibrous components. Out of the 500 compounds that have been described thus far [2][3][4][5], more than 90 are phytocannabinoids, including cannabidiolic acid (CBDA) [6] and delta 9-tetrahydrocannabinolicacid (THCA) [7]. The biosynthetic pathway of C. sativa phytocannabinoids and the characterization of related enzymes was recently elucidated [1]. The main enzymes are 3,5,7-trioxododecanoyl-CoA synthase (OLS, a polyketide synthase) and olivetolic acid cyclase (OAC) acting in succession to convert hexanoyl-CoA into olivetolic Fulfilling BUP and MDP potentials through the use of multiple proteases of various specificity ensures both a higher coverage of AA sequences and deeper proteome exploration, which is critical to discriminate closely related protein isoforms and detect various PTM sites, as well as robust and precise protein identification and quantification. Based on our previous BUP experience in which protein recovery from mature cannabis buds was optimised [11], the present study aims at improving our shotgun proteomics workflow, in particular the digestion steps, to identify more proteins with greater confidence and discover more PTMs. To this end, we developed a combinatorial multiple protease method on mature buds from medicinal cannabis. We chose three orthogonal digestions, namely chymotrypsin (of low specificity targeting hydrophobic residues Y, F, W, and to a lesser extent L), a ready-to-use mixture of trypsin/LysC (of medium specificity targeting positively charged residues R and K), and GluC (high specificity targeting negatively charged residues mostly E and occasionally D under our experimental conditions). These enzymes were carefully selected based on their specificity to yield peptides spanning from 0.5 to 10 kDa as to cover both BUP and MDP ranges. We first tested the digestion efficiency of the proteases on their own or in sequence using the shotgun reference protein BSA, and then applied the method to complex plant samples. Single-, double-and triple-enzyme digests were analysed by nLC-MS/MS. The results are discussed in terms of reproducibility, number of identified peptides, missed cleavages, PTM detection, AA sequence and proteome coverages.

Results and Discussion
In a previous study, we applied the gold standard protease in proteomics, trypsin, to digest cannabis proteins and analysed them using a BUP approach [11]. The proteome of mature apical buds was satisfactorily covered as assessed by the identification of all the enzymes involved in the biosynthesis of phytocannabinoids, along with many other enzymes from cannabis secondary metabolism (e.g., isoprenoids, terpenoid, and phenylpropanoid). However, AA sequence coverage never reached 100% and only a handful of PTMs were identified. More proteins identified based on more proteotypic peptides to ensure deeper proteome coverage can be reached by adopting a multiple protease strategy as demonstrated by a wide body of evidence [29,38,[44][45][46][47][48][50][51][52][53][54][55]. In this experiment, a trypsin/LysC (T) mixture, GluC (G) and chymotrypsin (C) were applied on their own or in combination, either subsequentially (denoted by an arrow "->") in a serial digestion fashion, or by pooling individual digests together (denoted by a colon ":"). The analytical method was first tested on BSA and then applied to complex plant samples. The experimental design is schematised in Figure 1.

Shotgun Proteomics on the Test Protein BSA
BSA is used as a positive control in our experiment as it is the gold standard used for shotgun proteomics [56]. It is a 66 kDa monomeric protein particularly amenable to trypsin digestion. The sequence coverage of BSA tryptic digest is often used to evaluate instrument performance as it is

Shotgun Proteomics on the Test Protein BSA
BSA is used as a positive control in our experiment as it is the gold standard used for shotgun proteomics [56]. It is a 66 kDa monomeric protein particularly amenable to trypsin digestion. The sequence coverage of BSA tryptic digest is often used to evaluate instrument performance as it is sensitive to both MS and MS/MS method settings. We routinely use BSA as a quality control of both enzymatic digestion and nLC-MS/MS analyses [11,24].

LC-MS/MS Data from BSA Digests Are Very Reproducible
Each BSA digests underwent nLC-MS/MS analyses in which each duty cycle comprised a full MS scan was followed by CID MS/MS events of the 20 most abundant parent ions above a 10,000 counts threshold. Supplementary Figure S1 displays the nLC-MS profiles corresponding to one replicate of each BSA digest. The peptides elute from 9 to 39 min corresponding to 9-39% ACN gradient, respectively and span m/z values from 300 to 1600. Visually, LC-MS patterns from samples subject to digestion with GluC followed by chymotrypsin (G->C) are relatively less complex than the other digests.
Technical duplicates of the BSA digests yield MS and MS/MS spectra of high reproducibility as can be seen in Table 1.   ,554  1884  2  9708  8885  9297  796  10  1618  1496  1557  244  17  2  BSA  SD  5707  6752  5811  1218  1  1317  1648  1333  756  1  544  531  501  136  4  1  min  77,085  79,616  79,570  440  1  6450  5163  5807  92  7  929  475  789  64  11  1  max  99,001 100,001 99,501  3794  5  11,311  10,115  10,505  2647  12  2428  2332  2237  444  24  3 All LC-MS patterns are highly complex with a multitude of ions. The number of MS peaks vary from 77,085 (G->C rep 1) to 100,001 (G:C rep 2) across all patterns and SDs range from 440 (T) to 3794 (T->C) with coefficient of variations (%CVs) always lower than 5%, even though a full set of eleven digest combinations (Figure 1) was run first (technical replicate 1), and then fully repeated in the same order (technical repiclate 2) with no randomization applied. The number of MS/MS events ranges from 5163 (6%, G->C rep 2) to 11,311 (13% T->G rep 1), which amounts to 10% of all the MS peaks on average ( Table 1). The number of MS/MS events per sample is determined by the duration of the run (50 min) and the duty cycle (3 s) which in turn is controlled by the resolution (60,000), number of microscans (2) and number of MS/MS events per cycle (20). In our experiment, a 50 min run allows for 1,000 cycles and 20,000 MS/MS events. Proteotypic peptides elute for 30 min, thus allowing for a maximum of 12,000 MS/MS scans. With an average number of 9297 MS/MS spectra obtained ( Table 1), 77% of the potential is thus achieved. Duty cycles can be shortened by lowering the resolving power of the instrument, minimizing the number of microscans and diminishing the number of MS/MS events. The MS/MS data were searched against a database containing the BSA sequence using SEQUEST algorithm for protein identification purpose. Of all the MS/MS spectra generated in this study, between 475 (9%, G->C rep 2) and 2428 (24%, T:C rep 1) are successfully annotated as BSA peptides (Table 1). On average, 17% of the MS/MS spectra yield positive database hits, which amounts to an average of 1.8% of MS peaks. The list of BSA peptides identified in this study is available in Supplementary Table S1. Trypsin/LysC yields 68 unique BSA peptides, GluC yields 79 unique BSA peptides, and chymotrypsin yields 104 unique BSA peptides. These values are greater than those reported by Giansanti and colleagues, with 37 trypic peptides, 31 GluC-associated peptides and 24 chymotryptic peptides [53], which explains our higher AA sequence coverage (discussed below). In a previous BUP study, BSA was identified with 40 unique peptides obtained using tryspin on its own [24], therefore the mixture trypsin/LysC enhances the digestion of BSA. The percentages of Table 1 are turned into a histogram in Supplementary Figure  S2 to better visualize the proportions of meaningful data across proteases. The proportion of MS peaks fragmented by MS/MS remains constant across BSA digests, oscillating around 10 ± 3% (light grey bars). The proportions of MS/MS spectra annotated in SEQUEST (i.e., successful hits) however show more variation across proteases (black bars). Higher percentages are reached when trypsin/LysC is employed on its own or in combination with GluC and/or chymotrypsin (Supplementary Figure S2). This is expected as BSA is notoriously amenable to trypsin digestion and often used as shotgun proteomics standard [56]. BSA (P02769) mature primary sequence contains 583 AAs, from position 25 to 607; the signal peptide (position 1 to 18) and propeptide (position 19 to 24) are excised during processing. In theory, BSA should favorably respond to each protease as it contains plethora of the AAs targeted during the digestion step. Supplementary Figure S3A indicates the AA composition of BSA. Targets of chymotrypsin (L, F, Y, and W) account for 19% of BSA sequence, targets of GluC (E and D) represent 17% of the sequence, and targets of trypsin/LysC (K, R) make 14% of the total AA composition of BSA. As these percentages are comparable, the difference in the numbers of MS/MS spectra successfully matched by SEQUEST from one protease to another cannot be attributed to digestion site predominance. When we compare these predicted percentages to those observed in our study based on unique peptides (Supplementary Figure S3B), we can see that all the targeted AAs indeed undergo cleavage. The predicted rate always exceeds the observed one, but only moderately for W, Y, E, K, and R residues (less than 1.5% difference). However, F, L, and in particular D residues present an observed cleavage rate that is much lower than the predicted one (Supplementary Figure S3B). The higher specificity of GluC towards E is explained by the use of ammonium bicarbonate as diluent. The lower cleavage efficiency towards L and D, among the most abundant AAs in BSA (Supplementary Figure  S3A), is likely to impact sequencing efficiency.
If we report the number of successfully annotated MS/MS events to that of MS peaks, percentages fluctuated from 1.0% (G->C) to 2.6% (T:C) ( Table 1 and dark grey bars in Supplementary Figure S2). This suggests that some of the digestion conditions devised in this work are better suited to sequence BSA than others. This also highlights that the vast majority of the ions generated by nLC-MS remain anonymous because either they are not selected as precursors or despite their being fragmented, the MS/MS output is not recognized in the searched database. Some of this is explained by the fact that the BSA standard used in their study is 98% pure and contains other proteins [23,24] that are ignored during the search stage on purpose. Maximizing the proportion of ions chosen for MS/MS fragmentation should improve database hit rates. fragmented, the MS/MS output is not recognized in the searched database. Some of this is explained by the fact that the BSA standard used in their study is 98% pure and contains other proteins [23,24] that are ignored during the search stage on purpose. Maximizing the proportion of ions chosen for MS/MS fragmentation should improve database hit rates.  Principal component analysis (PCA) shows that technical duplicates group together (Figure 2A). BSA samples arising from enzymatic digestion using chymotrypsin (C) in combination or not with GluC (G->C and G:C) separate from the rest, particularly tryptic digests (T), along PC 2 explaining 17.5% of the variance. Hierarchical clustering analysis (HCA) confirms PCA results and further indicates that samples treated with trypsin/LysC (T) and GluC (G) on their own or pooled (T:C) form one cluster (cluster 4, Figure 2B). The closest cluster (cluster 3) comprises all the samples subject to sequential digestions (represented by an arrow ->), except for digests resulting from the consecutive actions of GluC and chymotrypsin (G->C) which constitute a cluster on their own (cluster 1). The last cluster (cluster 2) groups chymotryptic samples with the remaining pooled digests (represented by a Principal component analysis (PCA) shows that technical duplicates group together (Figure 2A). BSA samples arising from enzymatic digestion using chymotrypsin (C) in combination or not with GluC (G->C and G:C) separate from the rest, particularly tryptic digests (T), along PC 2 explaining 17.5% of the variance. Hierarchical clustering analysis (HCA) confirms PCA results and further indicates that samples treated with trypsin/LysC (T) and GluC (G) on their own or pooled (T:C) form one cluster (cluster 4, Figure 2B). The closest cluster (cluster 3) comprises all the samples subject to sequential digestions (represented by an arrow ->), except for digests resulting from the consecutive actions of GluC and chymotrypsin (G->C) which constitute a cluster on their own (cluster 1). The last cluster (cluster 2) groups chymotryptic samples with the remaining pooled digests (represented by a colon). The fact that clusters 1-3 contains samples treated with chymotrypsin (except for T->G) suggests that this protease produces peptides with unique properties which impact the down-stream analytical process. This confirms that chymotrypsin indeed acts in an orthogonal fashion to trypsin as previously noted [28].
Based on the 589 unique BSA peptides identified in this study (Supplementary Table S1), we generated a BSA sequence alignment map ( Figure 2C) and coverage histogram ( Figure 2D). All digests considered, BSA sequence is at least 70% covered (G->C), up to 97% (T:G) ( Figure 2D), with an average of 87% coverage. Despite this almost complete coverage, the seven AA-long area positioned between residues 214 and 220 (ASSARQR) resists digestion, even though R residues targeted by trypsin are present ( Figure 2C). Perhaps BSA denaturation is incomplete under our experimental conditions and this domain is not exposed to the proteases. When we look at each type of digest individually, other areas resisting cleavage emerge, some of them common across different digests (e.g., position 162-171, LYEIARRHPY, shared between C, T->C, G->C, and T->G->C) or unique to a particular digest (e.g., position 268 to 275, CCHGDLLE, in G:C) ( Figure 2C). If we compare digests obtained using a unique enzyme, excellent BSA sequence coverage are observed: 91.3% for trypsin/LysC, 93.1% for GluC, and 90.2% for chymotrypsin ( Figure 2D). This also demonstrates comparable digestion efficiencies under our experimental conditions which abode to the protease manufacturer's guidelines. In a previous BUP study on milk, digestion of BSA using trypsin on its own achieved 65% AA sequence coverage [24]. Therefore, greater confidence in protein assignment is achieved by combining LysC to trypsin, as was reported [33][34][35][36]; this could also arise from a more optimum usage of the instrument duty cycle. In the protocol standardized by Giansanti and colleagues, BSA sequence coverage varied significantly depending on the protease used, with trypsin achieving 78.4%, GluC 61.9% and chymotrypsin 57.8% [53]. From our results, we conclude that, while trypsin is usually the gold standard for shotgun proteomics, these alternative enzymes should also be considered.
If we now turn our attention to digests obtained using multiple enzymes and compare sequential digestions (->) with pooled digests (:), we observe better alignment and coverage when individual digests are combined than when proteases are added. For instance, T->C digests covers 81% of the BSA sequence while T:C digest reach 91% coverage ( Figure 2D); the 10% difference represents 56 AAs. This is better exemplified when the three proteases are used together, with a 75% coverage in T->G->C samples and 94% coverage in T:G:C samples ( Figure 2D); the 19% difference represents 111 AAs. A similar multiprotease method was exploited by Liang and colleagues to achieve high sequence coverage of the toxic protein ricin and distinguish it from its close relative RCA120 agglutinin [57]. Irrespective of the proteases used, resorting to different enzymes to digests proteins from a complex biological matrix has proven extremely beneficial over the years as this strategy augments the AA sequence coverage and therefore strengthens protein inference as attested by the numerous publications on this topic [38,[44][45][46][47][48][50][51][52]54,55].

The Number of Miscleavages Is a Critical Parameter in Database Search
BSA peptides contain up to six miscleavages, with the majority (59%) presenting 1-3 miscleavages ( Figure 2F). The different digestion conditions peak at different miscleavages as can be seen in Supplementary Figure S4. For instance, the greatest number of tryptic and chymotryptic peptides exhibit one miscleavage while GluC-released peptides containing three miscleavages are the most numerous. The longest peptide (VSRSLGKVGTRCCTKPESERMPCTEDYLSLILNRLCVLHEKTPVSEKVTKCCTE, 6.4 kDa) released from the action of GluC contains eight charges and six miscleavages; it has a SpScore of 1572 and a Xcorr of 4.14 (Supplementary Table S1).
Traditionally in shotgun proteomics where trypsin is used to perform the enzymatic digestion of the protein extracts, the maximum number of missed cleavages is set to two. In our study, a significant proportion of BSA peptides (47%) contain more than two miscleavages (35% of BSA tryptic peptides) therefore we would have ignored almost half of the hits which such a restraint. Setting the number of allowed miscleavage to zero gives maximum discrimination between correct and incorrect matches but also assumes that the digestion was complete while increasing this number assumes digestion is partial [58]. Setting too high a number can be computationally taxing as it significantly augments the number of calculated peptide masses to be matched against the experimental data. In our experiment, we purposefully did not set a limit on the number of missed cleavages to assess whether longer peptides could be identified this way, as indeed is the case. In an MDP study on cell lysates, up to four missed cleavages were set for GluC in the search method [38]. Likewise, up to four miscleavages are recommended for chymotrypsin and GluC, and only two for trypsin [53].
For a given protein accession, the excess of limit-digested peptides (ELDP) is computed by subtracting the number of matched peptides with a missed cleavage site from the number of matched peptides without miscleavage [59]. ELDP is used to assess the completeness of BSA digestion. As the number of BSA peptides with one (248, Figure 2F) or more (941) miscleavages far exceeds the number of BSA peptides that bear no miscleavage (196), we can conclude that BSA digestion was partial and therefore the number of missed cleavage sites had to be increased.
Even though our experimental design cannot determine which missed cleavage parameters should have been used and that one might argue that too many missed cleavages are allowed, we have confidence in our results for the following reasons: (1) the decoy database search eliminates false positives, (2) only the peptides displaying a "high" minimum confidence in Proteome Discover (i.e., minimal FDR score for 2+ charge state = 0.01) are kept; indeed very long peptides displaying numerous miscleavages present very high score (Supplementary Table S1), (3) while up to ten miscleavages are permitted, a maximum of six are detected in the BSA dataset ( Figure 2F); if false positives were in the dataset, the whole range of miscleavages (i.e., up to 10) would have been detected. The reasons why the best studied protease, trypsin, regularly misses cleavage sites have been kinetically explained [29,32]; the propensity of GluC and chymotrypsin to miss seemingly cleavable sites should also be mechanistically elucidated. In the meantime, we recommend applying a number of missed cleavages greater than two during the database search stage.
From this experiment, we can conclude that BSA is highly amenable to enzymatic digestion by trypsin/LysC, GluC and chymotrypsin. Pooling the individual digests does not affect the nLC-MS/MS analysis as attested by the high sequencing coverage. Using multiple proteases consecutively yields relatively lower sequence coverage of BSA. This study also benchmarks the use of these digestion enzymes in our laboratory which can then be applied to complex biological samples such as plant protein extracts from mature buds of medicinal cannabis.

Medicinal Cannabis
Medicinal cannabis mature apical bud samples were extracted in triplicate from three plants of the same cultivar and aliquoted into equal protein content prior to performing the various enzymatic digestions in duplicate as schematized in Figure 1.

LC-MS Patterns from Medicinal Cannabis Digests Are Very Complex
As expected with protein-rich samples, the LC-MS patterns are very complex with cannabis peptides eluting from 9-39 min (9-39% ACN gradient) exhibiting m/z values spanning from 300 to 1700 (Supplementary Figure S5), which is comparable to what we previously observed [11]. The nLC-MS patterns corresponding to digests resulting from the action of GluC and chymotrypsin either sequentially (G->C) or mixed (G:C) present less peaks (Supplementary Figure S5). Digests resulting from the consecutive action of all three proteases (T->G->C) were consistently troublesome with the nanospray becoming sporadic from time to time, for no obvious reason. Furthermore, the pooled digests (denoted with a colon symbol) yielded unreproducible LC-MS patterns due to unstable nanospray (data not shown) therefore we did not include them to the rest of the analysis.
Statistical analyses were carried out on the volumes of the 27,123 peptides identified in this study. Multivariate analyses (PCA, PLS, HCA) were performed as well as a linear model which isolated 3349 peptides significantly responding to the digestion type (data not shown). The PCA projection plot of PC1 against PC2 using all identified peptides shows that samples are grouped by digestion type, with biological triplicate closely clustering together but technical duplicate separating out as they were run at two independent times ( Figure 3A), which can be resolved by randomizing the LC injection order.
PC1 explains 35% of the total variance and separates samples that include digestion with trypsin/LysC on the right-hand side away from the samples which do not on the left-hand side. PC2 explains 11.3% of the variance and discriminates samples on the basis of their treatment with (below zero) or without (above zero) chymotrypsin ( Figure 3A). Peptide mass is the determining factor behind the sample grouping across PC1xPC2 as can be seen on the PCA loading plot which illustrates that samples treated with GluC generate the longest peptides (>5 kDa, Figure 3B). A PLS analysis was performed using the 3349 peptides that were most significantly differentially expressed across the seven digestion types. This supervised statistical process helps better define groups according to a particular trait, in this instance the digestion type. The score plot of the first two components indeed achieves better separation of the different digestion types, with samples treated with GluC (G) away from all the other types ( Figure 3C). One group is composed of the samples treated with trypsin/LysC (T) on its own and combined to GluC (T->G). Another group comprises samples treated with chymotrypsin on its own (C) and with GluC (G->C). The last group positioned in between contains samples treated with trypsin/LysC and chymotrypsin (T->C), as well as with GluC (T->G->C). The main peptide characteristics behind such grouping is the m/z value as illustrated on the PLS loading plot ( Figure 3D). These types of multivariate analyses confirm the orthogonality of the proteases chosen in this experiment. study. Multivariate analyses (PCA, PLS, HCA) were performed as well as a linear model which isolated 3349 peptides significantly responding to the digestion type (data not shown). The PCA projection plot of PC1 against PC2 using all identified peptides shows that samples are grouped by digestion type, with biological triplicate closely clustering together but technical duplicate separating out as they were run at two independent times ( Figure 3A), which can be resolved by randomizing the LC injection order.   The MS data of cannabis digests is less reproducible across technical replicates than across biological replicates due to the fact that technical replicates were run independently. With an average of 9 ± 11%, CVs range from 0.3% (Bud 3 C) to 33% (Bud 3 T) and achieve less than 7% in 15 out of the 21 digestion tests ( Table 2). The number of MS/MS spectra obtained varied from 6,835 (Bud 2 T->G->C rep 1) to 12,827 (Bud 1 T rep 1), averaging 9072 ± 2047 which corresponds to 10% of the MS peaks.

Sequential Enzymatic Digestions of Medicinal Cannabis Samples Augment the Success Rate of MS/MS Annotations in SEQUEST
The MS data was searched against a C. sativa database using SEQUEST algorithm for protein identification purpose. Of all the MS/MS spectra generated from medicinal cannabis digests, between 824 (47% of the 1770 MS/MS spectra for Bud 2 T->G->C rep 2) and 4297 (38% of the 11,238 MS/MS spectra for Bud 3 T->C rep 1) are successfully annotated ( Table 2). On average, 29% of the MS/MS spectra yield positive database hits, which amounts to an average of 2.7% of MS peaks. The list of peptides from medicinal cannabis identified in this study is available in Supplementary Table S2.
The percentages of Table 2 are turned into a histogram in Supplementary Figure S6 to better visualize the proportions of useful data across proteases. As observed before for BSA samples, the proportion of MS peaks fragmented by MS/MS (light grey bars) remains fairly constant across the medicinal cannabis digests, ranging from 7-12% as it is set by the duty cycle. The proportion of MS/MS spectra annotated in SEQUEST (i.e., successful hits) however shows even more variation across proteases than reported on BSA above, fluctuating from 15% to 45% (black bars). Figure S6). In the case of medicinal cannabis protein extracts, resorting to a strategy involving sequential enzymatic digestions using two or three proteases proves very successful with high annotation rates: 28% for T->G, 34% for G->C, 37% for T->C and 45% for T->G->C (Supplementary Figure S6). However, even with such a high success rate, at best, 5% of MS peaks lead to positive identifications meaning that 95% of the data remains untapped into. One strategy to deepen the proteome analysis in a similar data-dependent acquisition manner would be to re-analyse each sample several times using iterative exclusion lists of the precursor ions already fragmented. Another strategy would be to employ a data-independent acquisition strategy such as the sequential window acquisition of all theoretical mass spectra (SWATH-MS) [60].

Higher percentages are reached when chymotrypsin (C) is employed on its own or in combination with trypsin/LysC (T->C) and/or GluC (G->C and T->G->C) (Supplementary
A total of 27,123 unique peptides from cannabis samples are identified in this study (Supplementary  Table S2). This far exceeds what we had previously achieved using a BUP based on trypsin digestion where 5675 unique peptides across all replicates were successfully matched [11]. If we consider some the characteristics of the medicinal cannabis peptides identified in this study, we can see that each protease behaves differently. For instance, the highest peptide ion scores are found among the peptides generated by trypsin/LysC, in particular when R residues are targeted, whereas the lowest scores belong to peptides resulting from the cleavage of D residues upon the action of GluC. Ion scores average around 6.1 ± 9.6 and reach up to 148 ( Figure 4A).
Out of the 27,123 cannabis peptides identified in this study, 80% (21,705) display one or multiple PTMs (Supplementary Table S2), including 4241 carbamidomethylation, 683 N-term acetylation, 18,716 phosphorylation and 9236 oxidation sites. Table 3 presents the distribution of PTMs per protease as imputed based on cleavage sites. Some annotated MS/MS spectra can be viewed in Figure S7. In these examples, peptides from ribulose bisphosphate carboxylase large chain (RBCL) are identified with high scores from GluC, chymotrypsin and trypsin/LysC ( Figure S7A). MS/MS annotation from SEQUEST in Supplementary Figure S7B illustrates how each enzyme helps extend the coverage of RBCL spanning the region Y29 to R79 (YQTKDTDILAAFRVTPQPGVPPEEAGAAVAAESSTGTWTTVWTDGLTSLDR) with chymotrypsin covering residues 41-66, GluC extending the coverage to the left down to residue 29 and Trypsin/LysC extending it to the right up to residue 79. MS/MS spectra display almost complete b-and y-series ions ( Figure S7B). RBCL is adorned with several dynamic PTMs, for instance oxidation of M116 ( Figure S7C) and phosphorylation of T173 and Y185 ( Figure S7D). Out of the 27,123 cannabis peptides identified in this study, 80% (21,705) display one or multiple PTMs (Supplementary Table S2), including 4241 carbamidomethylation, 683 N-term acetylation, 18,716 phosphorylation and 9236 oxidation sites. Table 3 presents the distribution of PTMs per protease as imputed based on cleavage sites. Some annotated MS/MS spectra can be viewed in Figure S7. In these examples, peptides from ribulose bisphosphate carboxylase large chain (RBCL) are identified with high scores from GluC, chymotrypsin and trypsin/LysC ( Figure S7A). MS/MS annotation from SEQUEST in Supplementary Figure S7B illustrates how each enzyme helps extend the coverage of RBCL spanning the region Y29 to R79 (YQTKDTDILAAFRVTPQPGVPPEEAGAAVAAESSTGTWTTVWTDGLTSLDR) with chymotrypsin covering residues 41-66, GluC extending the coverage to the left down to residue 29 and Trypsin/LysC extending it to the right up to residue 79. MS/MS spectra display almost complete b-and y-series ions ( Figure S7B). RBCL is adorned with several dynamic PTMs, for instance oxidation of M116 ( Figure S7C) and phosphorylation of T173 and Y185 ( Figure S7D).
The distribution of identified cannabis peptides according to the number of missed cleavages also reveals differences among proteases. Our method specified a maximum of ten missed cleavage sites, which is the highest number allowed in the Proteome Discoverer program and SEQUEST algorithm. Missed cleavages have been discussed in the BSA results. All things considered, only 5%  The distribution of identified cannabis peptides according to the number of missed cleavages also reveals differences among proteases. Our method specified a maximum of ten missed cleavage sites, which is the highest number allowed in the Proteome Discoverer program and SEQUEST algorithm. Missed cleavages have been discussed in the BSA results. All things considered, only 5% of the peptides present no missed cleavage and up to nine missed cleavages are detected in the MS/MS data ( Figure 4B and Supplementary Table S2). The greatest numbers of peptides resulting from trypsin/LysC or GluC present two missed cleavages while the largest number of chymotryptic peptides possess three missed cleavages. Therefore, setting the correct parameters is essential to maximise the number of successful hits. Had we limited our search method to two missed cleavages as is traditionally performed in shotgun proteomics, 65% of the tryptic cannabis peptides would not have been identified. Furthermore, allowing for more missed cleavage has the intrinsic benefit of yielding longer peptides, befitting the middle-down range, which is very advantageous for sequencing purposes. Indeed, in the present work, average masses of cannabis peptides steadily increase with the number of enzymatic cleaving sites missed, in a similar manner for each of the proteases ( Figure 4C).
When we observe the minimum masses, we can see that they increase with the number of missed cleavages, very similarly across all three proteases ( Figure 4D). The shortest cannabis peptide has a mass of 627.3956 Da (7 AAs, position 286-292, from Photosystem II protein D2), presents one miscleavage and arises from the action of chymotrypsin which is the least specific of the proteases tested in this experiment (Supplementary Table S2). When we observe the maximum masses, GluC systematically produce the largest peptides, fluctuating from 9479.692 to 10,027.014 Da, regardless of the number of missed cleavages ( Figure 4D). The longest peptide has a mass of 10,0027.014 Da (88 AAs, position 57 to 144, from CBDA synthase), bears six missed cleavage sites and arises from the action of GluC which is the most specific of the proteases tested in this work (Supplementary Table S2). Trypsin/LysC and chymotrypsin display similar patterns, namely the maximum masses increase as the number of missed cleavages go from 0 to 4, and then plateau around 9.6 kDa for subsequent numbers of missed cleavages ( Figure 4D).
In our previous BUP study on medicinal cannabis, we set the maximum number of missed cleavages to two [11]; we have since re-analysed these BUP results by setting the number of miscleavages to ten (the maximum) and found 43 additional peptides containing 3-9 missed cleavages thus confirming valuable information was ignored (data not shown). We exemplify this in Supplementary Figure S8 which summarises the sequencing results for OAC, a key enzyme in the phytocannabinoid biosynthetic pathway. The gain in the number of OAC peptides identified here relative to our previous study [11] equals 31 additional peptides (PTMs included), many of them being longer therefore covering larger portions of the AA sequence ( Figure S8A). Aligning the peptides along OAC sequence reveals that whilst complete coverage is achieved, more peptides from the N-terminus are identified relative to the C-terminus of the protein ( Figure S8B). Trypsin/LysC yields 85% coverage of OAC, GluC 57% and chymotrypsin 53% ( Figure S8C). In our previous BUP experiment only 34% coverage of OAC was reached using trypsin/LysC [11]; re-analysis with ten miscleavages brought OAC sequence coverage to 48% (data not shown). The complete 100% coverage of OAC observed here could only be achieved by combining the sequencing data associated with the four proteases together as none of them individually yield full coverage. This observation holds true for most proteins reported in this work.

Proteins from Medicinal Cannabis Are Identified with High Sequence Coverage
The 27,123 cannabis peptides identified here are assigned to 494 unique accessions (Table S3) which corresponds to 229 unique proteins (Table S4) from C. sativa and close relatives. This comprises 130 (57%) novel protein annotations when the list is compared to that of our previous BUP study [11]. The molecular weight (MW) of these cannabis proteins average 38 ± 34 kDa, ranging from 2.8 kDa (Photosystem II phosphoprotein) to 271.2 kDa (Protein Ycf2). The AA sequence coverage varies from 6% (NAD(P)H-quinone oxidoreductase subunit J, chloroplastic) to 100% (108 out of 229 identities, 47%). The vast majority of the proteins (187/229, 82%) display a sequence coverage greater than 80% (Supplementary Table S4) which is a great improvement to those obtained previously [11]. Therefore, using different proteases on their own or in combination allowed the identification of more proteins with greater confidence. This has repeatedly been demonstrated on various complex biological samples [38,[44][45][46][47][48][50][51][52]54,55]. Even though we did not prepare our samples with membrane proteins in mind (requiring specific membrane-disrupting chemicals such as sodium dodecyl sulphate), many of the identified accessions correspond to membrane proteins. They either display a transmembrane domain (141/494, 29%) and/or are localized to a biological membrane (100/494, 20%), mostly within chloroplasts (74%) (Supplementary Table S3). This Gene Ontology (GO) annotation is confirmed by elevated grand average of hydropathy (GRAVY) values (Supplementary Table S3). Chymotrypsin has been shown to help digest membrane proteins and therefore identify them by shotgun proteomics [61,62].
As previously observed [1], the 494 cannabis protein accessions are predominantly involved in cannabis secondary metabolism (23%), energy production (31%, including photosynthesis 18%), and gene expression (19%, in particular protein metabolism 14%) ( Figure 5). Using E.C. numbers, we performed a KEGG pathway mapping which highlights that all the phytocannabinoid-related enzymes and most of the enzymes involved in terpenoid backbone biosynthesis are identified in this study (Supplementary Figure S9). Ten percent of the proteins are of unknown function, including Cannabidiolic acid synthase-like 1 and 2 which display 84% similarity with CBDA synthase (data not shown). Most of the additional functions relative to [1] belong to the energy/photosynthesis pathway, translation mechanisms with many ribosomal proteins identified here (Supplementary Table S4), as well as a plethora (14.4%, 71 out of 494 accessions) of small auxin up regulated (SAUR) proteins (Supplementary Table S3). More significantly, all the enzymes involved in the cannabinoid biosynthetic pathway are identified and Using E.C. numbers, we performed a KEGG pathway mapping which highlights that all the phytocannabinoid-related enzymes and most of the enzymes involved in terpenoid backbone biosynthesis are identified in this study (Supplementary Figure S9). Ten percent of the proteins are of unknown function, including Cannabidiolic acid synthase-like 1 and 2 which display 84% similarity with CBDA synthase (data not shown). Most of the additional functions relative to [11] belong to the energy/photosynthesis pathway, translation mechanisms with many ribosomal proteins identified here (Supplementary Table S4), as well as a plethora (14.4%, 71 out of 494 accessions) of small auxin up regulated (SAUR) proteins (Supplementary Table S3). More significantly, all the enzymes involved in the cannabinoid biosynthetic pathway are identified and account for 14.4% of all the accessions ( Figure 5) when they made only 5.6% of the accessions in [11]. Additional proteins from this pathway are three truncated products from THCA synthase of 11, 33, and 49 kDa, as well as polyketide synthases 1 to 5 whose AA sequences show 95% similarity to that of OLS (data not shown). Newly identified proteins include enzymes from the isoprenoid biosynthetic pathway: 2-C-methyl-d-erythritol 4-phosphate cytidylyltransferase and 3-hydroxy-3-methylglutaryl coenzyme A synthase. A naringenin-chalcone synthase involved in the biosynthesis of phenylpropanoids is also newly identified here. Finally, novel elements of the terpenoid pathway include (+)-alpha-pinene synthase and 2-acylphloroglucinol 4-prenyltransferase found in the chloroplast (Supplementary Table S4). This study demonstrates that combining different proteases is needed to achieve deeper recovery and a more thorough analysis of the proteins not only involved in the secondary metabolism of C. sativa but in the diverse biological mechanisms occurring in the mature buds of this unique species. With an ultimate goal of analyzing multiple medicinal cannabis samples, the method presented here utilizes a relatively fast (60 min) nLC elution to ensure high throughput. We anticipate that deeper proteome coverage would be reached with longer nLC gradient or pre-fractionation steps.
C. sativa proteomics has come a long way since the first report in 2004 on reproductive organs and trichomes in which only 10 flower proteins and 54 gland proteins could be identified from Arabidopsis and rice accessions due to the absence of C. sativa database entries [19]. Ten years later in 2014, 481 proteins were identified from trichomes, with only 26 (5%) corresponding to C. sativa accessions [16]. The slow rate of entry creation from C. sativa species in public database was addressed in [11] where we highlighted that while the first C. sativa entry was created in 1986 in UniprotKB, by 2004 only six entries featured, and in 2014 entries amounted to 100. Most of C. sativa public entries (258) were uploaded in 2015-2017. There are currently (October 2019) 509 C. sativa protein accessions in UniprotKB. Early this year, we published a BUP study to optimize protein extraction from mature buds [11] in which 160 protein accessions were identified with 83% (133) of them matching a C. sativa accession. More recently, a 1-DE shotgun proteomics experiment showed that CBDA synthase and THCA synthases are secreted in trichome exudates and accumulated over the flowering period [20]. Using a TDP approach, we revealed previously unknown PTMs [12]. While progress has been made on cannabis proteomics, this area of research is still in its infancy and we hope that the pace will pick up in the future. Confirming the expression of C. sativa genes at the protein level will help validate the annotations of genome sequencing projects in a proteogenomic manner and facilitate breeding programs.

Materials and Methods
The experimental design is schematised in Figure 1.

Apical Bud Sampling and Grinding
Fresh plant material was obtained from the Victorian Government Medicinal Cannabis Cultivation Facility. Apical buds were excised using secateurs, snap frozen in liquid N 2 and stored at -80 • C until grinding. Samples were collected from three different plants. Frozen buds were ground with a mortar and pestle in liquid N 2 . The powder was transferred into a 15 mL tube and stored at -80 • C until further use.

Protein Extraction Methods
The protein extraction was described in [11] and up-scaled for this experiment as the same sample would undergo various protease digestions. Briefly, 0.5 g of ground frozen powder was transferred into a 15 mL tube kept on ice pre-filled with 12 mL ice-cold 10% TCA/10mM DTT/acetone (w/w/v). Tubes were vortexed for 1 min and left at −20 • C overnight. The next day, tubes were centrifuged for 10 min at 5000 rpm and 4 • C. The supernatant was discarded, and the pellet was resuspended in 10 mL of ice-cold 10mM DTT/acetone (w/v) by vortexing for 1 min. Tubes were left at −20 • C for 2 h. The tubes were centrifuged as specified before and the supernatant discarded. This washing step of the pellets was repeated once more. The pellets were dried for 60 min under a fume hood. The dry pellets were resuspended in 2 mL of guanidine-HCl buffer (6M guanidine-HCl, 10 mM DTT, 5.37 mM sodium citrate tribasic dihydrate, and 0.1 M Bis-Tris) by vortexing for 1 min, sonicating for 10 min and vortexing for another minute. Tubes were incubated at 60 • C for 60 min. The tubes were centrifuged as described above and 1.8 mL of the supernatant was transferred into 2 mL microtubes. A 40 µL volume of 1M iodoacetamide (IAA)/water (w/v) solution was added to the tubes to achieve a final IAA concentration of 20 mM and to alkylate the DTT-reduced proteins. The tubes were vortexed for 1 min and left to incubate at room temperature in the dark for 60 min.
A 10 mg/mL BSA solution was prepared in 1 mL of guanidine-HCl buffer. The tube was vortexed for 1 min and incubated at 60 • C for 60 min. A 20 µL volume of 1M IAA/water (w/v) solution was added to the tube to achieve a final IAA concentration of 20 mM. The BSA tube was vortexed for 1 min and left to incubate at room temperature in the dark for 60 min.

Protein Assay
Protein extracts were diluted ten times using the guanidine-HCl buffer prior to the assay. The protein concentrations were measured in triplicates using the Pierce Microplate BCA protein assay kit (Thermo Fisher Scientific Australia Pty Ltd, Scoresby, VIC, Australia) following the manufacturer's instructions. The BSA solution supplied in the kit (2 mg/mL) was used a standard.

Protein Digestion
An aliquot corresponding to 100 µg of BSA or plant proteins was used for digestion with proteases. Digestions were performed following the manufacturer's recommendations as detailed in the following subsections. We did not test different protease:protein ratios or different digestion buffers. We merely followed the manufacturer's guidelines.

Digestion Using a Trypsin/LysC Protease Mix (T)
The DTT-reduced and IAA-alkylated proteins were diluted six times using 50 mM Tris-HCl pH 8.0 to drop the guanidine-HCl resuspension buffer molarity to 1 M. Trypsin/LysC protease (Mass Spectrometry Grade, 100 µg, Promega, Alexandria, VIC, Australia) was carefully solubilised in 1 mL of 50 mM acetic acid and incubated at 37 • C for 15 min. A 40 µL aliquot of trypsin/LysC solution was added and gently mixed with the protein extracts thus achieving a 1:25 ratio of protease mix:proteins, as instructed by the manufacturer. The mixture was left to incubate overnight (18 h) at 37 • C in the dark.

Digestion Using GluC (G)
The DTT-reduced and IAA-alkylated proteins were diluted six times using 50 mM Ammonium bicarbonate (pH 7.8) to drop the guanidine-HCl resuspension buffer molarity to 1 M. Under these conditions, GluC protease (Mass Spectrometry Grade, 50 µg, Promega) presents a greater specificity towards E residues. GluC was carefully solubilised in 0.5 mL of ddH 2 O. A 10 µL aliquot of GluC solution was added and gently mixed with the protein extracts thus achieving a 1:100 ratio of protease:proteins. The mixture was left to incubate overnight (18 h) at 37 • C in the dark.

Digestion Using Chymotrypsin (C)
The DTT-reduced and IAA-alkylated proteins were diluted six times using 100 mM Tris/10mM CaCl 2 pH 8.0 to drop the guanidine-HCl resuspension buffer molarity to 1 M. Chymotrypsin protease (Sequencing Grade, 25 µg, Promega) was carefully solubilised in 0.25 mL of 1 M HCl. A 10 µL aliquot of chymotrypsin solution was added and gently mixed with the protein extracts, thus achieving a 1:100 ratio of protease:proteins. The mixture was left to incubate overnight (18 h) at 25 • C in the dark.

Sequential Digestions Using Several Proteases (G->C, T->G, T->C, T->G->C)
Digestion using GluC was performed as described above. A 10 µL aliquot of Chymotrypsin solution was then added, gently mixed with the GluC digest, and incubated at 25 • C in the dark for 18h. This yielded the digest we refer to as G->C. Digestion using Trypsin/LysC was performed as described above. A 10 µL aliquot of GluC or Chymotrypsin solution was then added and gently mixed with the trypsin/LysC digest. The tubes were incubated in the dark for 18 h at 37 • C for GluC or 25 • C for Chymotrypsin. These steps yielded the digests we refer to as T->G or T->C. For the sequential digestion combining all proteases, to the T->G sample was added a 10 µL aliquot of Chymotrypsin solution and incubated at 25 • C in the dark for 18h. This yielded the digest we refer to as T->G->C. In an effort to assess the efficiency of the sequential digestions (T->G, T->G, G->C, T->G->C), individual BSA digests resulting from the independent activity of Trypsin/LysC, GluC and Chymotrypsin were pooled together using the same volumes. Thus, the Trypsin/LysC digest was pooled with the GluC digest (T:G), the Trypsin/LysC digest was pooled with the Chymotrypsin digest (T:C), the GluC digest was pooled with the Chymotrypsin digest (G:C), and the three Trypsin/Lys, GluC and Chymotrypsin digests were also pooled together (T:G:C).

Desalting
All of the digestion reactions were stopped by lowering the pH of the mixture using a 10% formic acid (FA) in H 2 O (v/v) to a final concentration of 1% FA.
The digest was transferred into a 100 µL glass insert placed into a glass vial. The vials were positioned into the autosampler at 4 • C for immediate analyses by nLC-MS/MS.

Peptide Digest Analysis by Nano Liquid Chromatography-Tandem Mass Spectrometry (nLC-MS/MS)
The nLC-ESI-MS/MS analyses were performed on all the peptide digests in duplicate. Chromatographic separation of the peptides was performed by reverse phase (RP) using an Ultimate 3000 RSLCnano System (Dionex, Thermo Fisher Scientific Australia Pty Ltd) online with an Elite Orbitrap hybrid ion trap-Orbitrap mass spectrometer (Thermo Fisher Scientific Australia Pty Ltd). The parameters for nLC and MS/MS have been described in [24] and [11]. Briefly, 0.1 µg peptides were separated by nLC with a 3% to 40% B gradient over 35 min. Full MS scans were acquired over a mass range of 300 to 2000 m/z with a 60,000 resolution in profile mode. The 20 most intense peaks with charge state ≥ 2 and a minimum signal threshold of 10,000 were MS/MS fragmented in the linear ion trap using collision-induced dissociation (CID). Dynamic exclusion was enabled, and peaks selected for fragmentation more than once within 10 s were excluded from selection for 30 s. Each digest was injected twice, with first injecting all the digests (technical replicate 1) and then fully repeating the injections in the same order (technical replicate 2).

Database Search for Protein Identification and Annotation
This proof-of-concept work aims at demonstrating the gain in sequence coverage, hence protein identities, yielded upon the use of multiple proteases relative to our previous study [11]. To this end, we employ the same database search strategy. However, more sequences can be retrieved from genome sequencing projects [8][9][10], the National Center for Biotechnology Information (NCBI) website, and the Medicinal Plant Genomic Resource (MPGR) website. This will be achieved in our future experiments. Database searching of the RAW files was performed in Proteome Discoverer (PD) 1.4 using SEQUEST algorithm as described in [11]. All 668 Cannabis sativa protein sequences publicly available in October 2019 from UniprotKB (www.uniprot.org; key word used "Cannabis sativa", https://www.uniprot.org/uniprot/?query=taxonomy:3744%20cannabis%20sativa) were downloaded as a FASTA file. These also included 87 sequences from the European hop Humulus lupulus, the closest relative to C. sativa [63], as well as 72 sequences from the Chinese grass Boehmeria nivea also closely related to cannabis [63]. Because the GOT sequence was not included, we retrieved it from patent WO 2011/017798 Al [64] and included it in the FASTA file (669 entries, available as Supplementary Data). The FASTA file was imported and indexed in PD 1.4. The SEQUEST algorithm was used to search the indexed FASTA file. The database searching parameters specified trypsin, or GluC, or chymotrypsin or their respective combinations as the digestion enzymes and allowed for up to ten missed cleavages. The precursor mass tolerance was set at 10 ppm, and fragment mass tolerance set at 0.8 Da. Peptide absolute Xcorr threshold was set at 0.4, the fragment ion cutoff was set at 0.1%, and protein relevance threshold was set at 1.5. Carbamidomethylation (C) was set as a static modification and oxidation (M), phosphorylation (STY), and N-Terminus acetylation were set as dynamic modifications. The target decoy peptide-spectrum match (PSM) validator was used to estimate false discovery rates (FDR). At the peptide level, peptide confidence value set at high was used to filter the peptide identification, and the corresponding FDR on peptide level was less than 1%. At the protein level, protein grouping was enabled.
All nLC-MS/MS files are available from the stable public repository MassIVE at the following URL: http://massive.ucsd.edu/ProteoSAFe/datasets.jsp with the accession number MSV000084216. Using the FASTA sequences of the identified accessions, GRAVY values were retrieved online from https://www. bioinformatics.org/sms2/protein_gravy.html. Likewise, GO subcellular localizations were retrieved online from the UniprotKB Retrieve/ID mapping webpage (https://www.uniprot.org/uploadlists/). Enzyme E.C. numbers were retrieved from UniprotKB and BRENDA websites (https://www.brendaenzymes.org/). Pathway analysis was performed by uploading enzyme E.C. numbers into the online pathway mapping tool of KEGG website (https://www.genome.jp/kegg/tool/color_a_pathway.html).

nLC-MS/MS Data Processing
The data files obtained following nLC-MS/MS analysis were processed in the Refiner MS module of Genedata Expressionist ® 12.0 with the following parameters: (1) Load from file by restricting the range from 8-45 min, (2) metadata import, (3) Spectrum smoothing using moving average algorithm and a minimum of 5 points, (4) RT structure removal using a minimum of 3 scans, (5) m/z grid using an adaptative grid method with a scan count of 10 and a 10% smoothing, (6) chromatogram RT alignment with a pairwise alignment based tree, a maximum shift of 50 scans and no gap penalty, (7) chromatogram peak detection using a 10 scan summation window, a 0.1 min minimum peak size, 0.04 Da maximum merge distance, a boundaries merge strategy, a 20% gap/peak ratio, a curvature-based algorithm, intensity-weighed and using inflection points to determine boundaries, (8) MS/MS consolidation, (9) Proteome Discoverer import accepting only top-ranked database matches and no decoy results, (10) peak annotation, (11) export analyst using peak volumes and implicit logarithm (treating zeros as missing values).
A peptide mapping activity for BSA digest samples was also performed using the mature AA sequence of the protein (P02769|25-607) following step 8 (MS/MS consolidation) as follows: (12) Selection of the relevant protease digests, (13) peptide mapping using the following parameters: 10 ppm mass tolerance, ESI-CID/HCD instrument, 0.8 Da fragment tolerance, min fragment score of 30, top-ranked only, discard mass-only matches, enzymes varied according to the protease(s) used, 6 max missed cleavages, min peptide length of 3, fixed carbamidomethyl (C) modification, and variable oxidation (M) modification.

Statistical Analyses
Statistical analyses were performed using the Analyst module of Genedata Expressionist ® 12.0 where columns denote plant samples and rows denote digest peptides. Peak volumes exported from the Refiner module were used as a proxy of peptide quantities for all statistical analyses. Principal Component Analyses (PCA) were performed on rows using a covariance matrix with 40% valid log-transformed values and row mean as imputation. A linear model performed on rows and testing the digestion type. Partial Least Square (PLS) analyses were run on the most significant rows resulting from the linear model. PLS response was the digestion type with three latent factors, 50% valid values and row mean as imputation. Hierarchical clustering analysis (HCA) was performed on columns using positive correlation and Ward linkage method. Histograms were generated by exporting the number of peaks, number of MS/MS spectra, and masses of the identified peptides to a Microsoft excel 2016 (office 365) spreadsheet.