FungiRegEx: A Tool for Pattern Identification in Fungal Proteomic Sequences Using Regular Expressions

Terrón-Macias, Victor; Mejia, Jezreel; Canseco-Pérez, Miguel Angel; Muñoz, Mirna; Terrón-Hernández, Miguel

doi:10.3390/app14114429

Open AccessArticle

FungiRegEx: A Tool for Pattern Identification in Fungal Proteomic Sequences Using Regular Expressions

by

Victor Terrón-Macias

¹

,

Jezreel Mejia

¹

,

Miguel Angel Canseco-Pérez

^2,*,

Mirna Muñoz

¹

and

Miguel Terrón-Hernández

³

¹

Ingeniería de Software, Centro de Investigación en Matemáticas (CIMAT, A.C) Unidad Zacatecas, Zacatecas 98068, Mexico

²

Ingeniería Agroindustrial, Universidad Politécnica de Chiapas, Suchiapa 29150, Mexico

³

Ingeniería en Mantenimiento Industrial, Universidad Tecnológica de Tlaxcala, Huamantla 90500, Mexico

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(11), 4429; https://doi.org/10.3390/app14114429

Submission received: 29 March 2024 / Revised: 15 May 2024 / Accepted: 17 May 2024 / Published: 23 May 2024

(This article belongs to the Special Issue Recent Advances in Bioinformatics: Novel Techniques, Methods, and Applications)

Download

Browse Figures

Versions Notes

Abstract

In the context of proteomic-scale research, it is imperative to automatically analyze numerous species and subspecies to discern distinctive characteristics present in multiple species of the fungi kingdom that contain sequences of interest that could fulfill a specific biological function. To achieve this, complex sequences must be recognized within an organism’s entire set of proteomes. Our study presents FungiRegEx, a piece of software that facilitates the identification of regular expressions of proteomes of fungal organisms and uses real-time data retrieval of the different species from the JGI Mycocosm database without the need to download any file. Integrating a graphical user interface that makes it easy to use, the tool offers regular expression searches on 2402 fungal species from the JGI Mycocosm portal. The tool was validated with the AXSXG sequence and the RXRL effector, demonstrating the effectiveness of FungiRegEx in identifying user-defined patterns in the recovered sequences. This tool allows customization and filtering, and it can save results if required, combining speed, adaptability, and ease of use. It provides an experience without a console and programming, displaying the results in a GUI and making them easier to read. Its architecture guarantees optimized use of resources, time consumption, and implementation flexibility, allowing the customization of specific software parameters for resource management. The tool’s potential for future research and exploration is emphasized, providing a nuanced perspective on its practical use within the fungal genomics community. The tools are available at the addresses mentioned in the text.

Keywords:

proteome analysis; FungiRegEx; regular expressions finding; bioinformatics

1. Background

Understanding the characteristics of the species of the fungi phylogenetic tree requires the identification of specific sequences in the proteomes, which in turn may be correlated with the environment and its conditions [1]. The phylogenetic analysis provides a framework for research and identifying multiple similarities and conservation zones, which can be important for identifying certain protein functions or key structural regions. For example, searching for protein sequences containing a specific pattern can help to identify proteins that bind to certain ligands or have specific enzymatic activity [2]. Searching for repetitive patterns in protein sequences can also help to identify evolutionarily related proteins, which can provide information about the evolution of proteins and their functions over time [3]. However, a detailed analysis of proteomes is essential. Human experts can perform this task, but the analysis is challenging on a large scale.

In the biological field, hard-coded algorithms mostly traverse phylogenetic trees, and some software resources like grep and msgfdb2pepxml, among others, are described below.

Grep is a text-processing program designed for regular pattern matching within the text, allowing the search of regular expressions in a string; this string can be a proteome or any other type of text sequence [4]. Grep requires Linux as an Operating System, command line terminal mastery and knowledge of regular expressions. It is important to note that grep operates solely within the terminal. It lacks a Graphical User Interface (GUI).

Another resource for finding regular expressions is msgfdb2pepxml; this Python library converts the output from the MS-GFDB search engine to pepXML, uses regular expressions to recognize enzyme uses and cleavage rules, and supports PSI-MS [5]. Executing msgfdb2pepxml requires knowledge of the Python programming language and familiarity with library syntax (which also includes an understanding of regular expressions). Given its nature as a library, there is no GUI, reinforcing the importance of proficiency in understanding and manipulating the library through code.

PhyloPattern is another resource for finding regular expressions. It is a library focused on identifying regular expressions in phylogenetic trees; this library is not focused on proteomes or any other biological sequence [6]. To execute this tool, the user must have a Prolog engine and know the syntax of this programming language. Also, as this resource is a library, there is no GUI, underscoring the importance of manipulating the library through code.

PatScan is another resource, a program focused on searching for protein or nucleotide sequences of a pattern (regular expression) [7]. Executing this program involves compiling the source files and preparing a FASTA file for the search operation. Additionally, command-line terminal mastery for the compiling process and knowledge of regular expressions are essential.

PatMatch is another resource to identify repeats using regular expressions; this program does not automate searching for patterns within the sequences because it requires the user to write the complete sequence in which they want to search for the pattern; entering all the sequences of a genome or proteome can take a long time due to the size and number of elements [8]. For PatMatch utilization, access to the tool in the web browser is necessary. It should be noted that this tool focuses on peptide and nucleotide sequences, not on proteomic sequences; other relevant aspects of this tool are the length limitation of a search to less than 20 residues and the fact that it can only process one sequence at a time.

However, considering the tools’ requirements and characteristics, their implementation could be challenging, especially for users without proficiency in command-line terminal mastery and knowledge of programming.

None of the described tools and resources have been dedicated to detecting FUNGI regular expressions via web scraping. Surprisingly, there is a noticeable absence of software tailored to this specific task. Also, to use some of them, a file in a particular format containing all the sequences is required; the information in this file could be subject to errors if the source is not trustworthy.

Due to the need to automate the analysis process at a large scale, software that can be easily integrated into this process, which reads and analyzes proteomes on a large scale, detects matches, and saves considerable time without downloading files, is now needed.

In this context, we present FungiRegEx v1.0, a software tool to fill the identified challenges of the described tools, taking the available information from the Joint Genome Institute (JGI) [9] Mycocosm portal (guaranteeing that the information is trustworthy) and performing a search into the proteome databases of the multiple species with the user-defined regular expression through its web scraper module integrated into the tool; also, it integrates a Graphic User Interface (GUI) with a user-friendly interface, and as such, the user does not need to install any additional components, download additional files, or have solid programming knowledge to use it.

FungiRegEx helps in the recognition of repeated sequences, which holds substantial significance as it offers valuable insights into the functional and evolutionary roles of diverse organisms [10,11], driving evolution, inducing variation, and regulating gene expression [12]. Also, FungiRegEx is focused on FUNGI pattern detection and is customizable to adapt it to the resources of the computer or server where it is executed (in case the user wants a greater or lesser number of scraper instances). Finally, this tool could be deployed on a server or a computer if the user wants to.

Notably, FungiRegEx stands out by providing a GUI, eliminating the need for additional files or programming knowledge, and performing the search of the regular expressions in multiple sequences simultaneously while saving resources and time.

2. Materials and Methods

This section presents the materials and methods used to develop FungiRegEx software.

2.1. Data Source

It is imperative to obtain proteomic information from reliable sources, meaning it has been validated, is recognized in the field with extensive coverage, and consistently updates its information. Therefore, we integrate the JGI Mycocosm database, Walnut Creek, CA, USA [13].

2.2. Architecture of FungiRegEx

FungiRegEx front-end is based on React JS 17.0.2v, a JavaScript library that is both available and open-source and is designed for constructing interfaces [14], and Node JS 16.17v, serving as the back-end, which is a JavaScript runtime built upon the V8 JavaScript engine [15], as well as Chromium [16], which is an open-source web browser. The application provides an interactive GUI, as shown in Figure 2 in the Section 3. However, it is clarified that one of the characteristics of the GUI is that it allows the table results to be downloaded in a CSV file.

Also, internally, the application launches the scrapper instances into the JGI Mycocosm database and optimizes memory consumption because the application reuses each launched instance once it has obtained the information; in case of an error, the instance is automatically restarted.

The FungiRegEx workflow for the user is as follows: first, the user selects the type of search and the species, clarifying that new taxonomic additions of a Fungi Specie will not be available in the software unless the user adds them. Second, the user inputs the range to perform the search. Third, the search starts. Fourth, as mentioned before, the results can be filtered, ordered, and downloaded in an output file in CSV format.

2.3. Implementation

FungiRegEx is distributed as a ZIP file. The source code is available for download at https://sourceforge.net/p/fungiregex (accessed on 26 March 2024) and https://github.com/maigolinox/fungiregex (accessed on 26 March 2024). Once FungiRegEx has been downloaded and unzipped, the user must read the documentation, which contains detailed instructions on installing it locally or on a server if required. Briefly, to run FungiRegEx, only two commands are needed in separate instances of bash: the first one to execute the front end of FungiRegEx is npm run start:frontend and the last one to execute the back end of FungiRegEx is npm run start:backend.

The regular expression search implemented in FungiRegEx is based on finding exact repeats of length k along the proteomic sequence. The regular expression can be whatever length the user requires. If the protein sequence of length k is diminutive, the comparative process proceeds more expeditiously. Once the regular expression is found in the protein sequence of the FUNGI organism, it is filtered to eliminate those that do not match.

2.3.1. Searching for Regular Expressions Matches

The application back-end begins its search by creating a regular expression object.

Regular expressions are patterns used to search for character combinations in text strings. Regular expressions can contain various special characters and modifiers that define the pattern to search for [17].

The magnitude of the search range directly impacts the algorithm’s processing time; smaller ranges are preferred for optimal efficiency.

2.3.2. If Matches between Regular Expression and Protein Sequence Are Found

The search process identifies the regular expression within the amino acid chain with a length of k, as illustrated in Figure 1, where we can assume a chain of amino acids that, in accordance with the functionality of FungiRegEx, is capable of finding the matches of the regular expression throughout the chain, in addition to counting them. As the detected repeats do not accurately reflect the true length of the repeating pattern, they need to be expanded to match the actual repeat length. This paper’s chosen approach involves reading the characters to the left or right of all repeats and storing the matches in an array.

2.3.3. If No Matches between a Regular Expression and Protein Sequence Are Found

If no matches are found, the algorithm will continue the search in another chain. The search algorithm can progress in larger intervals without overlooking any repetitions. The crucial factor in progressing with larger intervals is ensuring the search algorithm never overlooks matches.

2.3.4. Processing Speed

An approximation of the speed will take the algorithm to process the regular expression given by the next mathematical formula.

t i m e = \frac{n u m b e r o f i d s}{n u m b e r o f \frac{p a g e s}{s e c o n d}}

(1)

Reducing the size of search intervals can improve processing speed. However, the algorithm’s speed may decrease if the intervals become too small. This is because smaller intervals can cause the program to spend more time launching browser instances than acquiring information and performing the search for the regular expression.

The next section presents the results obtained using FungiRegEx.

3. Results

As part of the results, we developed a GUI for using FungiRegEx, shown in Figure 2. This GUI is described in Section 2.2.

FungiRegEx was validated and tested by searching the sequence AXSXG as a regular expression, where AXSXG is a pentapeptide of a lipase group that brings thermostability and resistance to solvents of an enzyme [18] and that has been little described in fungi. The considerations to validate and test FungiRegEx with this sequence are as follows:

Specie: Saccharomyces cerevisiae (SacceM3836\_1) [19,20,21,22].
AXSXG pentapeptide, where X represents whatever amino acid.

To bring some results, we perform the search with the following parameters:

3.: A.S.G, where “.” in regular expressions syntax means whatever character.
4.: Search in a specific range: 1 to 2000. This means that FungiRegEx will launch the scrapper instances to retrieve the data in the JGI Mycocosm database from 2000 proteomes.

After running the search, the tool showed that of the 2000 scraped sequences, only one with identifier 1434 has that pentapeptide only once. It also identified matches in 281 sequences with similarity.

Figure 3 shows the results of the JGI Mycocosm retrieved data, where element 1 is the search tool of the browser searching for AHSMG regular expression, element 2 is the header of the sequence, specie SacceM3836_1 id 1434, and element 3 is the coincidence of the AHSMG in the sequence. Figure 4 shows the match with the FungiRegEx tool searching for AHSMG; element 1 is the parameters to perform the search, element 2 is the table results with the regular expression, and element 3 is the sequence identifier. With this, we can appreciate that the direct result of JGI Mycocosm data and the result with FungiRegEx is the same.

As mentioned, the tool also searches the complete sequence for other similarities according to the regular expression that the user inputs. This can be seen in Figure 5, where we hide the proteome column (Element 1) for the image size to show the results of FungiRegEx with the mentioned parameters.

In this way, the tool can find the regular expression in the user-determined search proteome, which may interest subsequent studies.

A second use case executed to validate FungiRegEx functionality involves the search for effectors. Liping Liu et al. [23] identified different effectors, such as the RXLR, asserting that fungi, oomycetes, and bacteria release small, secreted proteins crucial for symbiotic interaction and pathogenicity. Liping Liu investigated various effectors in different species, such as Mg3LysM (Mycosphaerella graminicola LysM), secreted by Mycosphaerella graminicola [24,25]. For this example, the JGI Mycocosm database of Mycosphaerella graminicola v2.0 will be used with the RXLR sequence, where X represents any amino acid. In regular expression language, the regular expression is R.LR.

Figure 6 shows the results of FungiRegEx, hiding the proteome column due to the size of the proteomes to show the results in the figure.

Figure 6 shows that FungiRegEx can find the effector of interest in the proteome. In addition, it is identified that Mycosphaerella graminicola has the effector RXRL [24]. Also relevant to the results is that a counter is increased each time the tool finds the regular expression in the sequence, and its value is displayed in the “# MATCHES” column.

FungiRegEx employs a straightforward sequential search method to identify regular expressions directly from the protein sequences of FUNGI organisms. Diverging from the prevalent approach of employing a suffix tree or alignment matrix as a primary data structure, the algorithm introduced in this paper operates by directly identifying regular expressions within the protein sequence. As a result, this methodology exhibits efficiency in memory usage due to launching and running the scraper instances, boasts enhanced comprehensibility and ease of implementation, and offers great speed in getting multiple sequences at the same time compared to if the process were carried out manually or using tools like PatMatch [8] that requires the user to introduce sequences one by one. Also, FungiRegEx does not require downloading any fasta or file.

With 200 puppeteer instances, 50,000 results are obtained in approximately 139 min, as can be seen in Figure 7, with the estimated time that indicates the monitor (depending on the resources of the computer, the computer can consult at least seven pages per second; this means that the 50,000 results can be consulted in only 12 min); it also clarifies that this depends on the available computer resources.

It should be mentioned that just requesting the JGI Mycocosm database and getting a response on the webpage takes around 6.64 s, as shown in Figure 8.

FungiRegEx demonstrated efficiency in proteomic sequence analysis and could streamline the process of analyzing proteomic sequences of the available species in the JGI Mycocosm portal. It eliminates the need to download FASTA files by dynamically retrieving data from the JGI Mycocosm database, ensuring real-time access. This significantly speeds up result retrieval compared to manual methods, optimizing time and computational resource usage. It offers the adaptability to run on any computer and allows customization of scrapper instances. The user-friendly GUI facilitates the process of searching regular expressions due to not requiring coding knowledge and presents results in a customizable table, enhancing accessibility. Moreover, it operates independently of specific operating systems and offers deployment options for local or server use. Additionally, it facilitates efficient result filtering and the identification of specific sequences. The console-free user experience further enhances accessibility, while the simplified search syntax with user-defined parameters aids in targeted searches. The tool’s potential for future research is notable, as it identifies user-defined regular expressions on certain proteomes, paving the way for further exploration. Finally, it provides detailed information about matching sequences, including the exact match count.

FungiRegEx enhances the efficiency of proteomic sequence analysis and provides a user-friendly, adaptable, and customizable tool with features for interpreting and exploring results without downloading any file to perform its functions. The next section discusses the obtained results.

4. Discussion

In this section, we discuss the findings and limitations associated with FungiRegEx. Table 1 shows the comparison between FungiRegEx’s features and those of the other tools. FungiRegEx highlights this due to its wide coverage, while the other tools cover a maximum of four features.

Below, we describe how those features are covered, highlighting the tool’s efficiency in data retrieval functionality, which allows for accelerated retrieval of results and optimized use of computational resources because each instance is reused instead of being launched again, compared to other tools that require having the data set archived to perform the search. FungiRegEx does not require downloading the file since it consults the information directly in the database. Users must manually enter characteristics of newly added fungal species to extend them beyond the registered species, potentially limiting their future applicability.

FungiRegEx introduces user-friendly features such as real-time data retrieval, customization adaptability since it allows the user to configure the parameters of the instances that can be launched depending on the computational resources available if required, and a Graphical User Interface (see Figure 1), presenting a solution for researchers who do not have the programming knowledge or bash mastery that are necessary for the described tools in the Section 1.

Removing FASTA file downloads through the FungiRegEx scrapper module advances real-time data recovery. Still, the dependency on Internet connectivity affects the search speed and varies depending on the quality of the user’s connection. Despite this limitation, the tool significantly speeds up result retrieval compared to manual extraction, which could revolutionize large-scale studies. The internet factors, such as speed and availability of computational resources, reveal the need to balance these factors to optimize performance, avoiding issues such as temporary IP blocking when deploying too many scraper instances. Given this last point of implementing scraper instances, it is recommended not to exceed 100 instances in parallel.

The efficient use of computational resources underlines the good functioning of FungiRegEx, coupled with the fact that users can configure the parameters for greater control of computational resources. Its adaptability to any computer and customizable parameters improve versatility despite limitations such as single-task support and possible task deletion in simultaneous use scenarios between different users.

The GUI simplifies the presentation of results, but readability problems arise in very long proteomes as the table where the results are displayed becomes very large, suggesting room for improvement in the interface.

Computer OS independence and deployment flexibility allow users to align runtime modes with their preferences and infrastructure, with a recommendation for local installation due to its single-user nature.

Regarding the last six covered features, FungiRegEx does not present any restriction regarding the length limit of the sequence to perform the regular expression search. Also, as mentioned before, this tool does not require the user to have programming knowledge or bash mastery to use it. Another relevant aspect is that this tool does not require a FASTA file; FungiRegEx can perform searches into multiple sequences at the same time. Finally, FungiRegEx allows users to store the results in CSV format if required.

The tool’s user-defined search parameters improve applicability, although users must specify proteome lengths to perform accurate searches.

The ability to filter user-defined regular expressions in results facilitates ongoing and future research by allowing the exploration of specific sequences within the returned results, provided that users already possess the regular expressions of interest they wish to search for.

Finally, it is important to note that the results can be stored in CSV format. This allows you to export the results and open the file in a text processor such as Microsoft Word or Excel, among others.

5. Conclusions

FungiRegEx represents a significant advancement in proteomic sequence analysis, offering researchers a streamlined and user-friendly approach. Handling datasets without downloading FASTA files accelerates research processes and facilitates broader investigations into fungal proteomes. Its real-time data retrieval capability from the JGI Mycocosm website enhances accessibility, although it depends on internet speed. FungiRegEx effectively utilizes computational resources and offers customization options, making it adaptable to various research needs and computational environments.

Despite the limitations mentioned in the Section 4, FungiRegEx demonstrates clear advantages in efficiency, speed, and ease of use compared to manual methods, thus representing a valuable tool for proteomic research. Addressing these limitations could further enhance its utility and impact on the scientific community.

Author Contributions

Conceptualization, J.M., M.A.C.-P. and M.T.-H.; Methodology, M.A.C.-P. and M.T.-H.; Software, V.T.-M., M.A.C.-P. and M.M.; Validation, M.M. and M.T.-H.; Formal analysis, V.T.-M., M.A.C.-P. and M.T.-H.; Investigation, J.M., M.A.C.-P. and M.M.; Resources, V.T.-M. and M.A.C.-P.; Writing—original draft, V.T.-M. and M.A.C.-P.; Writing—review & editing, V.T.-M., J.M., M.A.C.-P. and M.M.; Project administration, M.A.C.-P.; Funding acquisition, M.A.C.-P. All authors have read and agreed to the published version of the manuscript.

Funding

We received financial support from the Council of Science Technology and Innovation of Zacatecas state (COZCyT). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability Statement

The data presented in this study are available in JGI Mycocosm Database at https://mycocosm.jgi.doe.gov/Mycgr3/Mycgr3.home.html (accessed on 16 May 2024) for Mycosphaerella graminicola, reference number [26] and https://mycocosm.jgi.doe.gov/SacceM3836_1/SacceM3836_1.home.html (accessed on 16 May 2024) for Saccharomyces cerevisiae, reference number [17,18,19,20]. These data were derived from the following resources available in the public domain: https://mycocosm.jgi.doe.gov/mycocosm/home (accessed on 16 May 2024).

Conflicts of Interest

The authors declare no conflict of interest.

References

Muggia, L.; Ametrano, C.G.; Sterflinger, K.; Tesei, D. An Overview of Genomics, Phylogenomics and Proteomics Approaches in Ascomycota. Life 2020, 10, 356. [Google Scholar] [CrossRef] [PubMed]
Roche, D.B.; Brackenridge, D.A.; McGuffin, L.J. Proteins and Their Interacting Partners: An Introduction to Protein–Ligand Binding Site Prediction Methods. Int. J. Mol. Sci. 2015, 16, 29829–29842. [Google Scholar] [CrossRef] [PubMed]
Merski, M.; Młynarczyk, K.; Ludwiczak, J.; Skrzeczkowski, J.; Dunin-Horkawicz, S.; Górna, M.W. Self-analysis of repeat proteins reveals evolutionarily conserved patterns. BMC Bioinform. 2020, 21, 179. [Google Scholar] [CrossRef]
Bull, R.I.; Trevors, A.; Malton, A.J.; Godfrey, M.W. Semantic grep: Regular expressions + relational abstraction. In Proceedings of the Ninth Working Conference on Reverse Engineering, Richmond, VA, USA, 29 October–1 November 2002; pp. 267–276. [Google Scholar]
Nagaev, B.; Yashina, K.; Palmblad, M. msgfdb2pepxml (Version 2.0) [Python Script]. 2011. Available online: https://ms-utils.org/msgfdb2pepxml/ (accessed on 2 March 2022).
Gouret, P.; Thompson, J.D.; Pontarotti, P. PhyloPattern: Regular expressions to identify complex patterns in phylogenetic trees. BMC Bioinform. 2009, 10, 298. [Google Scholar] [CrossRef] [PubMed]
Dsouza, M.; Larsen, N.; Overbeek, R. Searching for patterns in genomic data. Trends Genet. 1997, 13, 497–498. [Google Scholar] [CrossRef] [PubMed]
Yan, T.; Yoo, D.; Berardini, T.Z.; Mueller, L.A.; Weems, D.C.; Weng, S.; Cherry, J.M.; Rhee, S.Y. PatMatch: A program for finding patterns in peptide and nucleotide sequences. Nucleic Acids Res. 2005, 13, W262–W266. [Google Scholar] [CrossRef] [PubMed]
Joint Genome Institute (JGI). About Us. Joint Genome Institute. 2022. Available online: https://jgi.doe.gov/about-us/ (accessed on 5 March 2022).
Achaz, G.; Coissac, E.; Netter, P.; Rocha, E.P. Associations between inverted repeats and the structural evolution of bacterial genomes. Genetics 2003, 164, 1279–1289. [Google Scholar] [CrossRef] [PubMed]
Van Belkum, A.; Scherer, S.; Van Alphen, L.; Verbrugh, H. Short-sequence DNA repeats in prokaryotic genomes. Microbiol. Mol. Biol. Rev. 1998, 62, 275–293. [Google Scholar] [CrossRef]
Liao, X.; Zhu, W.; Zhou, J.; Li, H.; Xu, X.; Zhang, B.; Gao, X. Repetitive DNA sequence detection and its role in the human genome. Commun. Biol. 2023, 6, 954. [Google Scholar] [CrossRef] [PubMed]
Nordberg, H.; Cantor, M.; Dusheyko, S.; Hua, S.; Poliakov, A.; Shabalov, I.; Smirnova, T.; Grigoriev, I.V.; Dubchak, I. The genome portal of the Department of Energy Joint Genome Institute: 2014 updates. Nucleic Acid Res. 2014, 42, D26–D31. [Google Scholar] [CrossRef] [PubMed]
Meta Platforms, Facebook Open Source, Getting Started, What Is React and Documentation. 2020. Available online: https://reactjs.org/docs/getting-started.html (accessed on 5 March 2022).
OpenJS Foundation. Getting Started, What Is Node JS and Documentation. OpenJS Foundation. 2020. Available online: https://nodejs.org/en/docs/ (accessed on 5 March 2022).
Google LLC. Getting Started, What Is Chromium and Documentation. 2020. Available online: https://www.chromium.org/Home/ (accessed on 2 February 2022).
MDN Web Docs. Regular Expressions. January 2022. Available online: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions (accessed on 3 February 2022).
Gutiérrez-Domínguez, D.E.; Chí-Manzanero, B.; Rodríguez-Argüello, M.M.; Todd, J.N.A.; Islas-Flores, I.; Canseco-Pérez, M.Á.; Canto-Canché, B. Identification of a Novel Lipase with AHSMG Pentapeptide in Hypocreales and Glomerellales Filamentous Fungi. Int. J. Mol. Sci. 2022, 23, 9367. [Google Scholar] [CrossRef] [PubMed]
Engel, S.R.; Wong, E.D.; Nash, R.S.; Aleksander, S.; Alexander, M.; Douglass, E.; Karra, K.; Miyasato, S.R.; Simison, M.; Skrzypek, M.S.; et al. New data and collaborations at the Saccharomyces Genome Database: Updated reference genome, alleles, and the Alliance of Genome Resources. Genetics 2022, 220, iyab224. [Google Scholar] [CrossRef] [PubMed]
Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar] [CrossRef] [PubMed]
Schäffer, A.A.; Aravind, L.; Madden, T.L.; Shavirin, S.; Spouge, J.L.; Wolf, Y.I.; Koonin, E.V.; Altschul, S.F. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001, 29, 2994–3005. [Google Scholar] [CrossRef] [PubMed]
Wong, E.D.; Miyasato, S.R.; Aleksander, S.; Karra, K.; Nash, R.S.; Skrzypek, M.S.; Weng, S.; Engel, S.R.; Cherry, J.M. Saccharomyces genome database update: Server architecture, pan-genome nomenclature, and external resources. Genetics 2023, 224, iyac191. [Google Scholar] [CrossRef] [PubMed]
Liu, L.; Xu, L.; Jia, Q.; Pan, R.; Oelmüller, R.; Zhang, W.; Wu, C. Arms race: Diverse effector proteins with conserved motifs. Plant Signal. Behav. 2019, 14, 1557008. [Google Scholar] [CrossRef] [PubMed]
Marshall, R.; Kombrink, A.; Motteram, J.; Loza-Reyes, E.; Lucas, J.; Hammond-Kosack, K.E.; Thomma, B.P.; Rudd, J.J. Analysis of Two in Planta Expressed LysM Effector Homologs from the Fungus Mycosphaerella graminicola Reveals Novel Functional Properties and Varying Contributions to Virulence on Wheat. Plant Physiol. 2011, 156, 756–769. [Google Scholar] [CrossRef] [PubMed]
Lee, W.S.; Rudd, J.J.; Hammond-Kosack, K.E.; Kanyuka, K. Mycosphaerella graminicola LysM Effector-Mediated Stealth Pathogenesis Subverts Recognition Through Both CERK1 and CEBiP Homologues in Wheat. Mol. Plant-Microbe Interact. 2014, 27, 236–243. [Google Scholar] [CrossRef] [PubMed]
Brown, S.D.; Klingeman, D.M.; Johnson, C.M.; Clum, A.; Aerts, A.; Salamov, A.; Sharma, A.; Zane, M.; Barry, K.; Grigoriev, I.V.; et al. Genome Sequences of Industrially Relevant Saccharomyces cerevisiae Strain M3707, Isolated from a Sample of Distillers Yeast and Four Haploid Derivatives. ASM J.—Genome Announc. 2013, 1, 10–1128. [Google Scholar] [CrossRef] [PubMed]

Figure 1. An occurrence of a match: coincidences in the amino acid chain are detected no matter the position of the regular expression in the chain.

Figure 2. GUI of FungiRegEx. Element 1 and Element 2 allow the user to choose the type of search; Element 3 is a selector of the species; Element 4 is a progress indicator; Element 5 is the Regular Expression input (where * means any character); Element 6 is the range input; Element 7 is the table of results.

Figure 3. Retrieved sequence of protein Saccharomyces cerevisiae M3836 v1.0. with ID 1434 [21]. Element 1 contains the string search input, Element 2 is the sequence information, and Element 3 is the sequence match.

Figure 4. Results from FungiRegEx using A.S.G regular expression filtering by the specific expression AHSMG. Element 1 shows the parameters for the search (type of search, species, range, and regular expression), Element 2 is the regular expression filter in table results, and Element 3 is the sequence that matches.

Figure 5. Results from FungiRegEx customizing the columns using and filtering results using A.S.G regular expression. Element 1 shows that table results columns can be enabled or disabled for the user’s convenience.

Figure 6. Results from FungiRegEx hiding the proteome column using R.LR effector regular expression.

Figure 7. Progress monitor and calculation of processing time. The puppeteer cluster includes a tool that monitors the progress of data acquisition and the performance of each instance.

Figure 8. Network tool in the web browser: you can see that the time it takes to request the JGI Mycocosm database server one by one is 6.64 s (the line in green indicates in seconds the time to get a response from the server); if the process were manual for 50,000 requests, it would take approximately 92 h.

Table 1. Features comparison vs. FungiRegEx. Where ✕ means not covered and ✔ means covered.

Tool	Optimal Resource Management?	Have a GUI?	Free of Computer OS Implementation?	Allow Full Sequence Length?	No Need to Have Programming or Syntax Knowledge?	No Need to Have Bash Mastery?	Can It Function without Downloading a FASTA File?	Can Process multiple Sequences at same time?	Is It Easy to Store Results?
grep	✔	✕	✕	✔	✕	✕	✕	✕	✕
msgfdb2pepxml	✔	✕	✔	✔	✕	✔	✕	✕	✕
PhyloPattern	✔	✕	✔	✔	✕	✔	✕	✕	✕
PatScan	✔	✕	✔	✔	✕	✔	✕	✕	✕
PatMatch	✔	✕	✔	✕	✕	✔	✕	✕	✕
FungiRegEx	✔	✔	✔	✔	✔	✔	✔	✔	✔

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Terrón-Macias, V.; Mejia, J.; Canseco-Pérez, M.A.; Muñoz, M.; Terrón-Hernández, M. FungiRegEx: A Tool for Pattern Identification in Fungal Proteomic Sequences Using Regular Expressions. Appl. Sci. 2024, 14, 4429. https://doi.org/10.3390/app14114429

AMA Style

Terrón-Macias V, Mejia J, Canseco-Pérez MA, Muñoz M, Terrón-Hernández M. FungiRegEx: A Tool for Pattern Identification in Fungal Proteomic Sequences Using Regular Expressions. Applied Sciences. 2024; 14(11):4429. https://doi.org/10.3390/app14114429

Chicago/Turabian Style

Terrón-Macias, Victor, Jezreel Mejia, Miguel Angel Canseco-Pérez, Mirna Muñoz, and Miguel Terrón-Hernández. 2024. "FungiRegEx: A Tool for Pattern Identification in Fungal Proteomic Sequences Using Regular Expressions" Applied Sciences 14, no. 11: 4429. https://doi.org/10.3390/app14114429

APA Style

Terrón-Macias, V., Mejia, J., Canseco-Pérez, M. A., Muñoz, M., & Terrón-Hernández, M. (2024). FungiRegEx: A Tool for Pattern Identification in Fungal Proteomic Sequences Using Regular Expressions. Applied Sciences, 14(11), 4429. https://doi.org/10.3390/app14114429

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FungiRegEx: A Tool for Pattern Identification in Fungal Proteomic Sequences Using Regular Expressions

Abstract

1. Background

2. Materials and Methods

2.1. Data Source

2.2. Architecture of FungiRegEx

2.3. Implementation

2.3.1. Searching for Regular Expressions Matches

2.3.2. If Matches between Regular Expression and Protein Sequence Are Found

2.3.3. If No Matches between a Regular Expression and Protein Sequence Are Found

2.3.4. Processing Speed

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI