Predicting and validating protein degradation in proteomes using deep learning

Age, disease, and exposure to environmental factors can induce tissue remodelling and alterations in protein structure and abundance. In the case of human skin, ultraviolet radiation (UVR)-induced photo-ageing has a profound effect on dermal extracellular matrix (ECM) proteins. We have previously shown that ECM proteins rich in UV-chromophore amino acids are differentially susceptible to UVR. However, this UVR-mediated mechanism alone does not explain the loss of UV-chromophore-poor assemblies such as collagen. Here, we aim to develop novel bioinformatics tools to predict the relative susceptibility of human skin proteins to not only UVR and photodynamically produced ROS but also to endogenous proteases. We test the validity of these protease cleavage site predictions against experimental datasets (both previously published and our own, derived by exposure of either purified ECM proteins or a complex cell-derived proteome, to matrix metalloproteinase [MMP]-9). Our deep Bidirectional Recurrent Neural Network (BRNN) models for cleavage site prediction in nine MMPs, four cathepsins, elastase-2, and granzyme-B perform better than existing models when validated against both simple and complex protein mixtures. We have combined our new BRNN protease cleavage prediction models with predictions of relative UVR/ROS susceptibility (based on amino acid composition) into the Manchester Proteome Susceptibility Calculator (MPSC) webapp http://www.manchesterproteome.manchester.ac.uk/#/MPSC (or http://130.88.96.141/#/MPSC). Application of the MPSC to the dermal proteome suggests that fibrillar collagens and elastic fibres will be preferentially degraded by proteases alone and by UVR/ROS and protease in combination, respectively. We also identify novel targets of oxidative damage and protease activity including dermatopontin (DPT), fibulins (EFEMP-1,-2, FBLN-1,-2,-5), defensins (DEFB1, DEFA3, DEFA1B, DEFB4B), proteases and protease inhibitors themselves (CTSA, CTSB, CTSZ, CTSD, TIMPs-1,-2,-3, SPINK6, CST6, PI3, SERPINF1, SERPINA-1,-3,-12). The MPSC webapp has the potential to identify novel protein biomarkers of tissue damage and to aid the characterisation of protease degradomics leading to improved identification of novel therapeutic targets.


Introduction
Although the causative mechanisms of ageing are not yet fully understood, there is compelling evidence that biochemical pathways, including protein oxidation, proteolysis and protease-mediated cleavage, contribute to loss of proteostasis (protein homeostasis) and the age-related decline of organs (1). Similarly, oxidative stress and aberrant protease activity are also implicated in pathological remodelling of both acute and chronically inflamed tissues and organs (2)(3)(4).
In skin, ultraviolet radiation (UVR), oxidative stress and upregulated protease activity are interlinked processes associated with clinical photoageing which manifests in the dermis as profound histological remodelling of fibrillar collagen and elastic fibres and the accumulation of protein carbonyls, oxidative damage (5)(6)(7)(8). Collectively this UV, ROS and protease driven proteolysis in ageing and diseased tissue can be termed the degradome (9). A better understanding of the degradomic degeneration of tissue proteomes may lead to identification of novel biomarkers of damage and bioactive matrikines which can be used to design novel therapeutics (10,11).
Previous studies from our group suggest that both the in vivo remodelling of the elastic fibreassociated fibrillin microfibrils, which is a hallmark of early photoageing, and the relative molecular and ultrastructural susceptibility of these assemblies in vitro to UVR is likely to be due to specific amino acid (AA) compositions of major components such as fibrillin-1 (7,12,13). These proteins, and others also enriched in both UVR-absorbing (UV-chromophore) and oxidation-sensitive AA residues (e.g. fibronectin and lens crystallins), are susceptible to degradation by environmentally attainable UVR doses (14). In contrast, these doses and wavelengths do not affect the electrophoretic mobility of collagen I or tropoelastin, which are largely devoid of these AA residues, or the ultrastructure of collagen VI which contains fewer UV-chromophores than either fibronectin or fibrillin-1 (13). Recently we have also demonstrated, using a newly-developed proteomic peptide location fingerprinting methodology, that UVR exposure can induce subtle structure-associated changes in the largest collagen VI alpha chain (alpha-3) which are not detectable as changes in global collagen VI microfibril ultrastructure (15) or architecture (16). Therefore, AA composition appears to be a good predictor of relative susceptibility to UVR and oxidative damage (which is also a factor in non-UVR exposed tissues) (17)(18)(19)(20). However, predicting the relative susceptibility of proteins to protease-mediated cleavage is a more difficult task, as enzymatic proteolysis is dependent on not only the primary structure (i.e. AA sequence) but also on protein folding and interactions between enzymes and the exposed protein surface, and hence 2-dimensional (D), 3D or quaternary structure (21).
The matrix metalloproteinases (MMPs) are a large family of zinc-dependent enzymes which are thought to play key roles in skin photoageing and many other age-and disease-related disorders (22).
Other enzyme families such as serine proteases can also degrade ECM proteins (23,24) but their action in skin ageing is not as well characterised. For enzymes such as trypsin (where cleavage occurs at the C-terminal side of Arg and Lys AA residues when not followed by Pro) and the Staphylococcus aureus protease V8 -GluC (where cleavage occurs on the C-terminal side of Glu in preference to Asp), the validated prediction of cleavage sites is well established (25,26). However, the prediction of cleavage sites in ECM proteins by proteases such as the MMPs and cathepsins requires the development and application of complex mathematical models based on state-of-the-art machine learning and deep learning techniques which utilise data available in databases such as MEROPS (27,28). A number of bioinformatic tools have been developed to predict cleavage sites in AA sequences. For example, PROSPER, which is based on Support Vector Machines (SVMs), utilises some structural features to perform the prediction and DeepCleave, developed more recently, uses state-of-the-art deep convolutional neural network (CNN) (29)(30)(31)(32). However, to our knowledge the predictions of these algorithms have not been experimentally validated against native ECM proteins, nor has their comparative performance been evaluated in this context. Moreover, although recurrent neural networks have shown promising results in protein sequence and function prediction (33), this architecture has not previously been evaluated for proteolytic cleavage site prediction (34,35), to our knowledge.
This first goal of this study was to use the AA sequence information and relevant sequence-derived features to: i) predict and stratify the relative susceptibility of dermal ECM proteins (as defined by the Manchester Skin Proteome (36)) to UVR/oxidative damage; ii) relate these predictions to published experimental data; and iii) identify new potential targets of photodamage and photo-oxidation. The second goal was to test the ability of PROSPER and DeepCleave to identify MMP9-determined cleavage sites in two exemplar purified proteins (decorin [DCN] and vitronectin [VTN]) and subsequently in a complex ECM proteome derived from cultured human dermal fibroblasts (HDFs).
Given the relatively poor performance of these algorithms against native ECM proteins, our third aim was to develop and evaluate a new protease cleavage prediction algorithm: Manchester Proteome Cleave (MPC) which uses state-of-the-art deep bidirectional recurrent neural network (deep BRNN) architecture (34,35). Finally, our last aim was to integrate both UVR/ROS and protease prediction methods into a webtool, termed the 'Manchester Proteome Susceptibility Calculator (MPSC)' which can predict the relative susceptibility of proteins to degradative mechanisms and hence identify potential novel biomarkers of age-and disease-induced tissue damage.

Results
Predicted UVR/ROS susceptibilities correlate well with published experimental data and reveal novel potential markers of photodegradation.
Our first aim was to survey the amino acid composition of all skin proteins to predict relative UVR/ROS susceptibility. Ultraviolet radiation can be subdivided into three distinct wavelength ranges -UVC (100-280 nm), UVB (280-315 nm), and UVA (315-400 nm); UVC is absorbed by the ozone layer, and so solar UVR at the Earth's surface consists of 5% UVB and 95% UVA. These wavelengths can penetrate skin, generating reactive oxygen species (ROS) (7,37). The AA residues Trp, Tyr and double-bonded Cys (Cys=Cys) are sensitive to both biologically relevant wavelengths of UVR and oxidation (14), whereas Met, His, and Cys are sensitive to oxidation alone (38). We have previously established that proteins rich in these AA residues are at risk of degradation by UVR and/or ROS (7). In this study, building upon our initial insights, we have established an effective mathematical model allowing the prediction of relative protein susceptibilities to UVR/ROS within a whole skin proteome. To validate the UV/ROS model, we reviewed published data from experimental studies which subjected ECM proteins and assemblies to physiologically relevant UVR doses and wavelengths and categorised selected proteins as susceptible, semi-susceptible or resistant accordingly (Supporting File 1). We show that the mathematical model correctly predicts the relative experimental susceptibilities of these proteins ( Fig.1.a). We have integrated this mathematical model in our webtool, the MPSC, which allows analysis of protein sequences for their UV, ROS and UV/ROS susceptibilities.
Applying the MPSC to ECM proteins in the (as defined by the Manchester Skin Proteome) we predict that many elastic fibre-associated proteins which play roles in fibre formation, structure and organisation (FBN1, FBN2, FBLN1, FBLN5, LTBP2, LTBP3, LTBP4, EFEMP2, MFAP2 and MFAP4) (39,40), will be highly susceptible to degradation by both UVR and oxidation (Fig.1 b). These predictions agree well with in vivo observations of dermal remodelling in photoageing. For example, a key hallmark of severely photoaged skin is the presence of disorganised material containing multiple elastic fibre proteins, termed solar elastosis (41) whilst mildly photoaged skin is characterised by the loss of FBN1 and FBLN5 microfibrils from the upper dermis which is exposed to the highest UVR dose (42,43). We have demonstrated in vitro that chromophore-rich fibrillin-1 (in the form of fibrillin microfibrils) is susceptible to low dose UVR whilst chromophore-poor tropoelastin (ELN; the elastin precursor) is resistant (12,13). Our analysis predicts that skin collagens would be relatively resistant to both UVR and to ROS-mediated oxidation (Fig.1a&b). Previously we have shown that the eletrophoretic mobility of collagen I, the most abundant skin protein, is unaffected by physiolgically relevant doses of UVR (12). In contrast, it is clear from observational studies that collagen I abundance decreases in aged (and particularly in photo-aged) human skin, which leads to dermal thinning (44,45). In additon to predicting the relative UVR/ROS suceptibilties of major structural components such as elastin and the collagens, our analysis also identifies potential novel markers involved in the regulation of TGFβ and collagen fibril formation (DPT) and ECM development (MGP) which, to our knowledge, have not previously been implicated in photoageing ( Fig.1.b).
Whilst UVR/oxidative damage is clearly an important mediator of protein degradation in skin, protease-induced proteolysis is also thought to play a key role. Degradative mechanisms do not act in isolation (50,51) and we, and others, have shown that these two mechanisms may interact with UVR exposure to enhance subsequent protease action (7,52). Our prediction that dermal collagens I, III, IV, V and VII are likely to be highly resistant to UVR and ROS implies that their degradation in vivo must be mediated by other mechanisms such as extracellular proteases. Applying thresholds to distinguish between experimentally susceptible and resistant proteins suggest that a composition of > 5% UVR AA residues and > 10% oxidation-sensitive AA residues may be indicative of UV/ROS susceptibility (susceptible = grey box, resistant = green box). b) Predicted UVRand ROS-susceptibilities of skin's ECM proteins. Elastic fibre-associated proteins, except for elastin itself, and collagens (green) clearly stratify into susceptible and resistant risk categories, respectively. c) Predicted UVR and ROS susceptibility of non-ECM extracellular proteins. Defensins (red) are highlighted as an example of a susceptible protein family.

Accurate prediction of protease cleavage sites in native ECM proteins is challenging.
We next aimed to predict the location and number of protease cleavage sites within skin-expressed proteins. As the relative abundance of AA residues can predict protein susceptibility to UVR/ROS, we hypothesise that protease cleavage site load can determine the relative susceptibility of proteins to enzymatic degradation. In skin, MMPs and other proteases such as elastase and members of the cathepsin and granzyme families play a significant role in tissue maintenance and remodelling (50,51).
Prediction of proteolytic cleavage sites is dependent on not only the 1 o AA sequence but also the higher order 2 o -4 o structures which determine features such as solvent accessibility and disordered regions (53,54). Although currently available algorithms such as PROSPER and DeepCleave can predict cleavage sites with a good degree of accuracy (29,30), these predictions are: i) often limited to a specific subset of proteases, and; ii) lack experimental validation both in a simple and complex model systems. In this study, we used SDS-PAGE and mass spectrometry to characterise protein degradation and to validate and compare protease cleavage site predictions (by PROSPER and DeepCleave) for two purified native ECM proteins.
Initially, we used a simple model system, digesting purified DCN and VTN with recombinant MMP9 (chosen as an exemplary enzyme due to its importance in skin photoageing process (55)). By gelelectrophoresis we confirmed that both DCN and VTN are MMP9 substrates (56,57). In quadruplicate experiments, following MMP9 exposure, the band corresponding to VTN was absent and there was evidence of substantial aggregation. In contrast, the DCN band remained detectable following MMP9 exposure although there was some evidence of degradation ( Fig.2.a) (Supporting File 2). Cleavage sites in these two proteins were characterised experimentally by performing subsequent in-gel digestion of the same gels with trypsin and liquid chromatography tandem mass spectrometry (LC-MS/MS) followed by a bioinformatic analysis pipeline which searched for non-tryptic (assumed to be MMP9-derived) sites in the digested samples. Using LC-MS/MS, we identified 33 putative cleavage sites in VTN compared with 18 in DCN (Fig, 2b). LC-MS/MS revealed that cleavage sites in VTN were predominantly densely packed between AAs 300-400, corresponding to the heparin-binding domain which has proven involvement in fibronectin deposition (58). In contrast, DCN cleavage sites were distributed throughout the AA sequence ( Fig.2.b). We next compared these experimentally determined cleavage sites with those predicted by PROSPER and DeepCleave. While PROSPER and DeepCleave achieve excellent performance against data available in MEROPS, their accuracy in predicting MMP9 cleavage sites in native VTN and DCN (as evaluated by AUC score) was only slightly better than random (PROSPER AUC: 0.55; DeepCleave AUC: 0.64) (Fig.2.c).
In a parallel set of experiments, we generated a complex ECM proteome by decellularizing a post- These performance results highlight the need for better proteolysis models which utilise not only state-of-the-art machine learning approaches but also more expansive, up-to-date training datasets of native protein substrates, calibrated against experimental data.

Protease cleavage site prediction performance can be improved using a deep bidirectional recurrent neural network architecture
Given the relatively poor performance of existing algorithms in predicting MMP9 cleavage in native ECM proteins we next aimed to develop and evaluate a new protease cleavage prediction algorithm.
In order to better understand the complex proteolysis in a tissue such as ageing/inflamed skin, it is important that developed computational models are capable of accurately predicting cleavage sites for a wide range of proteases for which there is limited experimental evidence. Recurrent neural network (RNN) architectures, initially developed for natural language processing, are particularly suited to model genomic and proteomic sequences (59). This is particularly important for modelling proteolysis, as the AAs surrounding the cleavage sites play a significant role in determining the cleavage specificity (60). RNNs have recently achieved notable success in the field of proteomics, but have not yet been used to model proteolysis (35,61). Here we adapted the methodologies employed by both PROSPER and DeepCleave to develop a novel, deep bidirectional recurrent neural network (deep BRNN) based proteolysis prediction algorithm calibrated against inhouse experimental datasets.
For data collection, the training, testing and validating datasets for serine proteases and MMPs were collected from the MEROPS database (27)  In addition to the primary AA sequences, secondary structure, disordered regions and solvent accessibility may also play significant roles in determining the probability of protease cleavage (30).
We therefore used the methods previously employed by PROSPER (PSIPRED (63), DISOPRED2 (64) and ACCPRO 5.2 (65,66)) to predict these structural features for the whole human proteome prior to sequence encoding and model training. Each AA for every protein, in combination with these structural features, was encoded in a format suitable for RNNs. For sequence encoding, a sliding 8 AA window (4 AAs upstream and 4 AAs downstream of the predicted cleavage site) was utilised (Fig.3.b).
To build a cleavage site prediction model we needed to address two challenges: i) there is limited data available that reports MMP9 substrate specificity (e.g. MEROPs containing only 53 protein substrates corresponding to 301 cleavage sites), which can be used to develop deep learning models (Supporting File 3) and; ii) the large imbalance between cleavage sites vs non-cleavage (i.e. the number of noncleavage sites vastly outweighs the number of cleavage sites) resulting in unbalanced datasets. The first challenge was addressed by: 1) complimenting the MEROPS cleavage data with data available in the Eckhard 2016 study (62) and; 2) using transfer learning approaches, where a general protease cleavage model was pretrained using all the available data (separately for MMPs and other proteases) and subsequently used as a starting point to train protease-specific models (using only the cleavage data for a specific protease) (67). The second issue was addressed by weighting the model prior to model training to "pay more attention" to the minority (cleavage sites) class. For model training, different architectures, depths and hyperparameters of deep learning networks were evaluated in terms of AUC, F1 and MCC scores against cleavage sites sourced from three test sets: i) protein substrate identities for each protease from the MEROPS and the Eckhard 2016 study which were excluded from the training datasets (consisting of 15% of the total number, randomly selected); ii) the DCN, VTN and; iii) HDF MMP9 cleavage datasets as previously described. The best performing architecture was composed of four bi-LSTM layers, one dense layer and one fully connected layer ( Fig.3.c) Fig.3.d), respectively. Similarly, MPC performed better for the 15% protein test set than PROSPER and DeepCleave for all other modelled proteases. For each of the proteases there was some agreement between MPC, DeepCleave and PROSPER predicted cleavage sites; however, each algorithm had also identified cleavage sites unique to the model (Supporting File 3). Having developed both a model capable of predicting oxidative and/or UVR susceptibility and a new algorithm capable of predicting individual protease cleavage sites, we next aimed combine these into a webtool suited to predicting all these aspects of protein susceptibility to damage. In a similar approach to UV/ROS calculations, we calculated the number of protease cleavage sites per length of the protein which we correlated to protein susceptibility. To create a more encompassing statistical output with a better generalisation capability, the final output of MPSC consists of an ensemble model averaging five MPC model outputs (Fig.3d) The HDF-derived matrix experiment not only provided information on MMP9 substrates but also revealed 20 proteins which did not appear to have LC-MS/MS-detectable MMP9 cleavage sites. These 20 proteins included COL3A1, COL12A1, LTBP2, VCAN and others (Fig.2.d). We have used the identities of these cleaved and non-cleaved proteins to determine whether predicted MMP9 protease susceptibilities differed statistically between these two groups. The MPSC score for MMP9 was set to 0.8 which corresponds to highly confident cleavage sites. Using this approach, proteins with no experimentally detected MMP9 cleavage in the HDF-derived proteome also had a significantly lower predicted MMP9 susceptibility than the proteins that had at least one MMP9 cleavage site (p = 0.005, student's t test). This analysis suggests that novel substrates for proteolytic degradation may be identified using in silico proteolytic modelling.
Using the MPSC webtool, we initially analysed skin ECM proteins for the predicted protease and UVR/ROS susceptibilities using MPSC-MPC models. This analysis suggests that tropoelastin (the elastin precursor) will be particularly susceptible to MMP-mediated cleavage which agrees well with experimental observations (69). In addition, multiple proteins involved in elastic-fibre formation and organisation (FBLN1, FBLN2, FBLN5 EFEMP1, EFEMP2, MFAP4 and FBN2) were also predicted to be susceptible to not only MMP proteolysis but also to UVR/ROS. These predictions suggest that both protease mediated and UVR/oxidative damage may play a role in the deterioration of the elastic fibre architecture which is characteristic of photodamage in human skin (70). We have previously shown an interplay between these mechanisms whereby UVR exposure enhances the degradation of fibrillin microfibrils (52).
In addition to elastic fibres, collagens also undergo profound remodelling in photo-exposed skin.
Protease-mediated activity has long been suspected as a driver of age-related degradation of skin collagens (44,45) but given the intermittent and/or chronic low-level action of these mechanisms, the causative link has yet to be established. Whilst the protease susceptibility of these major collagens has been confirmed experimentally (71), our computational techniques also suggest that many skin collagens (i.e. II, III, IV, V, VI, XII, and XV) will be degraded by proteases (MMP -1, -3, -9, Granzyme-B, and Cathepsin K) but not by UVR/ROS. For the ubiquitous microfibrillar collagen VI assemblies, protein abundance and architecture is resistant to photodamage (16). The in vivo resistance of collagen VI to photodamage (16) suggests that this ECM assembly is resistant to multiple degradative agents.
However, the alpha 5 chain of COL6 is degraded by Cathepsin K (72,73) and we have shown subtle UVinduced changes in the structure-associated features by mass spectrometry (15). We therefore conjecture that differential mechanisms may drive the degradation of elastic fibre components and skin collagens, necessitating different preventative and/or therapeutic approaches.

Discussion
Degradomics is an ever-expanding field with the potential to impact translational research by deepening our understanding of tissue regeneration processes as well as contributing to drug discovery efforts (9,77). In this work we show that analysis of AA sequence and extracted structural features, in combination with state-of-the-art deep BRNN, is capable of predicting proteolytic cleavage sites with a better degree of accuracy.
Furthermore, we can use these predictions and sequence chromophore counts to identify not only known protein targets of photodamage but also potential novel protein targets and the relative susceptibilities within key protein families. With this work, we have stratified proteins within the entire human skin proteome to reveal their predicted susceptibilities to UV, ROS and proteolysis providing significant novel insights in skin research. Particularly, we have highlighted the fact that fibrillar collagens are predicted to be preferentially degraded by proteases alone whereas elastic fibres are likely to be susceptible to UVR/ROS and proteases in combination. We also demonstrate that proteins involved in catalytic processes have a high percentage of predicted confident cleavage sites per protein length. Moreover, we have made this analysis applicable to any AA sequence of interest.
Beyond the field of dermatology, the concordance between our computational analyses and previously reported observations of both in vitro and in vivo protein degradation suggests that this approach has the potential to identify novel protein biomarkers for tissues subjected to inflammation or ageing-related disease.
Whilst the MPC model performed better than existing models, its success in predicting experimental cleavage sites in native proteins still requires further improvement. Furthermore, while we have attempted initial experimental MMP9 cleavage site detection in a complex HDF proteome using LC-MS/MS approaches, it is clear that many cleavage sites may have been missed, particularly for low abundance proteins, where the peptide-coverage of the protein is also low; therefore, experimental approaches and sample preparation methods require even further improvement. The difficulty in predicting cleavage sites in native proteins may be partly due to the heterogeneous nature of the data currently available in public databases such as MEROPS upon which the algorithms were built. For example, elastin degradation by proteases is very well-defined, but well-known MMP substrates (such as FBN1) are not been necessarily reflected in the available databases (27,69,78). Also data referenced in these databases is drawn from multiple experimental methods applied to disparate biological samples including peptide libraries, extracted cartilage proteins, post mortem brain tissue, and many others which may not necessarily translate to a different in vitro systems (27). Critically, we have shown using an LC-MS/MS peptide location fingerprinting approach that ECM proteins can exhibit tissue-dependent structures (79). Furthermore, higher order structures play a crucial roles in determining proteolysis; however, currently the accurate prediction of protein secondary and tertiary features remains challenging (80). In this context, improved algorithms trained on expanded experimental datasets with additional informative features would benefit proteolytic cleavage site prediction. Future development of these techniques will depend on the availability of training data encompassing key target proteomes and proteases and improved prediction by taking into consideration the impact of local protein structures on the protease-specific cleavage outcomes.
Overall, MPSC webtool builds upon the success of the Prosper and DeepCleave algorithms which have contributed significantly to the degradomics fields and can further assist in novel biomarker discovery, reveal the primary AA sequence susceptibilities to UV/ROS and proteases and even assist in a novel bioactive matrikine discovery (81).

Deep RNN protease model Evaluation Metrics
As the dataset is highly imbalanced containing thousands of non-cleavage sites vs only hundreds of cleavage sites, and thus in order to evaluate the performance of PROSPER, DeepCleave and MPC, we

Feature extraction
PROSPER revealed that not only the AA sequence context surrounding cleavage sites, but also protein secondary structure, disordered regions and solvent accessibility play important roles in proteolytic cleavage site prediction (30). The whole human proteome was annotated with PSIPRED (63), DISOPRED2 (64) and ACCPRO 5.2 (65,66) to retrieve these different types of sequence-derived structural features.

One-Hot encoding
Each AA was encoded using the one-hot encoding, resulting in a 20-dimenisonal vector where each dimension represents one of the 20 common AAs. To gain the insights from the surrounding AAs a sliding window of -4 and +4 AAs additional to the cleavage site was used for each sequence. At the Nand C-termini where there are no following AAs, each position of 20-dimensional vector contained only zeros. Furthermore, we complemented each of these vectors with three-dimensional coordinates retrieved from SCRATCH (x, y, z), one-hot encoded representation of whether amino acid was part of coil, strand or helix, two-dimensional one-hot vector encoding whether it was exposed or buried, as well as two-dimensional one-hot vector of whether AA was disordered or not.

Train, test and validate data split
In contrast with previously published models, which split the data in train, test and validate datasets on the cleavage site basis regardless of the protein source, we have ensured that the testing, training and validating data all come from independent proteins. We used TensorFlow random seed to ensure the consistency of evaluations and ensured that the same protein identities were selected each time when evaluating the model performance; 70% of the proteins were used for the training, 15% for testing and 15% for validating.

Architecture of the Deep RNN
We used the Python 3.8 TensorFlow Keras package to implement our MPC model. There are many more non-cleavage sites than the cleavage sites resulting in a very imbalanced dataset. To overcome this, we assigned larger weights to cleavage sites than non-cleavage sites, enforcing the classifier to "pay more attention" to the underrepresented class. We also utilised transfer learning to overcome the issue with limited amounts of data available for certain proteases (67). We ensured that the general model and protease-specific model contained the same identities for testing, training, and validating. MPC protease models consists of four bidirectional LSTM layers, fully connected Dense layer and an output layer. We set epochs to a very large number (>10 000) and monitored the early where Trp represents tryptophan, Tyr represents tyrosine, Met represents methionine, Cys represents cysteine, His represents histidine and Cys=Cys represents disulphide bound cystine.

Cell culture
Human dermal fibroblasts (HDFs) were cultured from a scalp biopsy, obtained from a hair transplant

HDF-deposited ECM in vitro
The procedure for obtaining HDF-deposited ECM was carried-out as previously described (83  Frederiksborg, Denmark), ensuring peptide stability until mass spectrometer was available.

DCN and VTN gel sample preparation
Excised bands were first placed into a perforated well plate, and then shrunk by washing in acetonitrile for five minutes. To remove the acetonitrile from the samples these were centrifuged for one minute at 1500 rpm. Sample gel pieces were then dried in a vacuum centrifuge for 15 minutes. Next, samples were covered completely by 10 mM dithiothreitol (DTT) in 25 mM ammonium bicarbonate and incubated at 56°C for one hour to reduce the proteins. Samples were cooled to room temperature, then centrifuged to remove the DTT. Iodoacetamide (55 mM) in 25 mM ammonium bicarbonate was added, and the samples were incubated for 45 minutes in the dark at room temperature. The

Peptide preparation for mass spectrometry
For LC-MS\MS analysis, 20 µl of Injection Solution (5% acetonitrile (ACN) + 0.1% formic acid (FA) in ultrapure water) was added to each dried peptide samples. Peptide concentration in solution was measured using Direct Detect Infrared Spectrometer (Millipore). According to the concentration detected for each sample, they were diluted to have 12 µl of solution with a concentration of 800 ng/µl, and samples were submitted to the Biological Mass Spectrometry Core Facility (University of Manchester).

Liquid chromatography tandem mass spectrometry
Mass spectrometry was performed according to the Facility's protocols (84,85). Peptides were selected for fragmentation automatically by data dependent analysis. CP and CMG designed and performed MMP9 HDF experiments. CEMG and REBW and contributed to the interpretation of results. All authors contributed to reviewing and editing of the paper.