Next Article in Journal
Influence of Global Regulatory Factors on Fengycin Synthesis by Bacillus amyloliquefaciens TF28
Previous Article in Journal
Mesophilic Trickle-Bed Reactors for Enhanced Ex Situ Biogas Upgrading at Short Gas Retention Times: Process Performance and Microbial Insights
Previous Article in Special Issue
Kinetics of Lactic Acid, Acetic Acid and Ethanol Production During Submerged Cultivation of a Forest Litter-Based Biofertilizer
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Evolutionarily Distinct Enzymes Uncovered Through Sequence Similarity Network Analysis of De Novo Transcriptomes from Underexplored Protist Axenic Cultures

by
Manabu W. L. Tanimura
1,2,*,
Motoki Kayama
2 and
Kazumi Matsuoka
2,3
1
Graduate School of Human and Environmental Studies, Kyoto University, Kyoto 606-8316, Japan
2
R&D, Seed Bank Co., Ltd., Kyoto 606-8267, Japan
3
C/O Institute for East China Sea Research, Nagasaki University, Nagasaki 852-8521, Japan
*
Author to whom correspondence should be addressed.
Fermentation 2026, 12(2), 71; https://doi.org/10.3390/fermentation12020071
Submission received: 11 December 2025 / Revised: 19 January 2026 / Accepted: 20 January 2026 / Published: 27 January 2026

Abstract

Protists represent a vast yet underexplored reservoir of enzymatic diversity across the eukaryotic tree of life. In this study, we established axenic strains of diverse protists from four major eukaryotic supergroups using single-cell isolation and generated de novo transcriptomes for each strain, as reference genomes or transcriptomes are not available for these strains. As a test case for industrial enzyme discovery, we targeted nine enzyme classes used in pulp processing and evaluated whether protist-derived sequences occupy underrepresented sequence space relative to major public databases. Functional annotation combined with Sequence Similarity Network analysis revealed multiple clusters composed exclusively of protist-origin sequences, indicating candidate enzymes with high sequence-level novelty. These results suggest that protists may provide a practical resource for expanding the repertoire of industrially relevant enzymes and prioritizing targets for further characterization. However, additional in silico analyses and experimental validation will be required to determine whether these sequence-divergent candidates exhibit properties that meet industrial requirements.

1. Introduction

Rapid economic growth and industrialization have resulted in the large-scale consumption of fossil fuels, leading to a steady depletion of underground reserves. Therefore, renewable and sustainable alternative energy sources are urgently required. Cellulose is the most abundant biomass resource on Earth and is mainly produced by terrestrial plants. As a renewable feedstock, cellulose represents an attractive candidate for bioenergy production [1]. In contrast to first-generation biofuel feedstocks such as corn and sugarcane, lignocellulosic biomass—including wood waste, agricultural residues, sugarcane bagasse, and biorefinery by-products [2]—can be hydrolyzed into glucose and subsequently fermented into a variety of biofuels such as ethanol, methanol, butanol, hydrocarbons [3], biodiesel [4], biomethane [5], and biohydrogen [6]. However, lignocellulose degradation is the rate-limiting step, and efficient cellulose hydrolysis is the key to improving conversion yields. Accordingly, cellulose-degrading enzymes such as cellulases [7]; hemicellulose-degrading enzymes, including xylanases [8]; and pectin-degrading enzymes, such as pectinases [9] have long been essential tools in the biorefinery industry.
Meanwhile, the pulp and paper industry has increasingly relied on lignocellulosic feedstocks, including recycled paper, due to global forest decline. In addition to hydrolytic enzymes, bleaching/deinking [10], pitch control (removal of sticky precipitates consisting of triglycerides, fatty acids, and resins) [11], and other enzymatic treatments are also crucial for efficient paper recycling.
Not every enzyme is effective on every substrate or under all processing conditions. Industrial applications often demand enzymes that remain active under harsh physicochemical conditions—high or low temperatures, acidic or alkaline pH, low water activity, high salinity, toxic metal ions, exposure to chemical denaturants, or high substrate concentrations [12,13,14]. Enzymes capable of functioning under such stresses are of great commercial interest.
To date, two primary strategies have been adopted to obtain such enzymes. First, bioprospecting extremophiles—microorganisms thriving in hot springs, highly acidic or alkaline environments, etc. For example, thermostable DNA polymerases used in PCR were discovered from thermophilic bacteria [15]. Similarly, many currently industrialized hydrolytic enzymes were originally identified from extremophiles. The second approach involves protein engineering to generate improved or chimeric enzymes not found in nature—analogous to engineered GFP variants that produce diverse fluorescent colors [16]. Many industrial cellulases [17], lipases [18], and esterases [19] have been optimized through recombinant engineering.
However, both strategies have limitations. Extremophile-derived enzymes have predominantly been explored only among certain bacterial and fungal lineages. Our previous work suggested that numerous invertebrate animals encode endogenous cellulases, indicating that completely unexplored cellulase families may exist and exhibit novel catalytic properties [20,21]. Moreover, protists remain vastly underexplored due to challenges in their isolation and cultivation [22]. In UniProtKB/Swiss-Prot, curated protist enzyme sequences are scarce, despite protists representing the overwhelming majority of eukaryotic supergroup diversity [23,24]. Thus, the probability of discovering completely novel cellulose-degrading enzymes—and other industrially relevant enzymes—from protists is exceptionally high.
In this study, we targeted protist strains because major public protein resources (e.g., UniProtKB/Swiss-Prot) contain relatively limited protist-derived sequences and because protist species richness is widely considered to be substantially underestimated [25,26]. We selected axenic protist strains isolated in our laboratory that collectively span diverse eukaryotic supergroups, and focused on industrially relevant enzymes, particularly those associated with pulp and related bioprocessing, based on de novo transcriptomes. By comparing these protist-derived enzymes with sequences from public databases using Sequence Similarity Network (SSN) analysis [27,28], we assessed whether protist lineages contain enzyme variants that occupy underrepresented or divergent sequence space and therefore may constitute a promising reservoir for future enzyme mining.

2. Materials and Methods

2.1. Protist Isolation, Culturing and Total RNA Extraction

Four strains of green algae (Archaeplastida), one strain of Cryptophyceae (Cryptista), one strain of Haptophyceae (Haptista), two strains of Chrysophyceae (TSAR), five strains of diatoms (TSAR), one strain of Dinophyceae (TSAR), and one strain of Eustigmatophyceae (TSAR) were isolated from environmental samples using the single-cell pipetting method established in our previous study [29]. While sampling site coordinates and complete raw transcriptome data cannot be publicly released due to industrial confidentiality agreements, all processed enzyme sequence datasets required to support the conclusions of this study are deposited in the Dryad online repository, as noted in the Data Availability section. Additional metadata and materials may be provided upon reasonable request to the corresponding author, subject to confidentiality restrictions. The isolated strains were initially cultured in 24-well plates using the media described in Table 1. Cryptista, Haptista, and TSAR strains were grown under 100 μmol photons s−1 m−2, whereas Archaeplastida strains were grown under 200 μmol photons s−1 m−2.
For scale-up, cultures were transferred to 50-mL flasks containing 1 mL of seed culture at approximately 106 cells mL−1 derived from plate cultures, and incubated for two weeks under the same light conditions. Cells were harvested by centrifugation at 2500× g for 10 min. Pelleted cells were transferred to 2-mL tubes, and 1 mL of TRIzol reagent (Thermo Fisher Scientific, Invitrogen, MA, USA) was added. Total RNA extraction was performed following the manufacturer’s standard protocol. Purified RNA samples were immediately stored at −80 °C and shipped to a commercial high-throughput sequencing provider (Rhelixa Co., Ltd., Tokyo, Japan) within one week.

2.2. Trimming, De Novo Transcriptome Assembly, and Functional Annotation

Raw sequencing reads were first processed using Atria (version 3.2.1) to detect and trim adapter sequences under default parameters. Read quality was assessed with FastQC (version 0.12.0) to confirm the removal of adapter contamination and to evaluate sequence length distributions. De novo transcriptome assembly was performed using Trinity (version 2.15.0) with default settings. The completeness of the assembled transcriptomes was evaluated using BUSCO (version 5.4.5) in transcriptome mode with the eukaryote_odb10 dataset, ensuring comprehensive coverage of expected single-copy orthologous genes. Coding regions were predicted using TransDecoder (version 5.7.1), and the resulting open reading frames (ORFs) were functionally annotated through the Trinotate workflow (version 4.0.2). Translated protein sequences were queried against the UniProtKB/Swiss-Prot database using BLASTp (veersion 2.10.0) (e-value cutoff: 1 × 10−5) implemented within Trinotate. Conserved protein domains were detected using HMMER (version 3.3.2; e-value cutoff: 1 × 10−10) against the Pfam database. Gene Ontology terms were additionally assigned based on matched functional information using default settings. All annotation results were consolidated within the Trinotate framework and exported as .CSV file for downstream analyses.

2.3. Protist Species Identification

Initial species identification was performed based on microscopic observations during strain isolation. For accurate taxonomic assignment, 18S rRNA gene cDNA sequences corresponding to the morphologically inferred species were retrieved from GenBank and used as queries in BLASTn (version 2.17.0) searches against each assembled transcriptome. The top-matching sequences were then extracted and subjected to a second BLASTn (version 2.17.0) search using the NCBI online BLAST tool. Final species identification was determined based on the highest-scoring matches from this analysis.

2.4. Retrieval of Target Enzyme Sequences

In this study, we focused on nine classes of enzymes with established industrial applications in papermaking: cellulases, xylanases, pectinases, mannanases, catalases, lipases, laccases, esterases, and amylases. The corresponding Enzyme Commission (EC) numbers for each enzyme class were obtained from the ExPASy Enzyme Nomenclature Database. Protein sequences associated with these EC numbers were subsequently retrieved from the UniProtKB/Swiss-Prot database. To capture phylogenetic diversity, sequences were downloaded separately for each major eukaryotic supergroup using the built-in taxonomy filtering tool. All retrieved sequences were then processed to unify gene identifiers and merged with the enzyme sequences identified in this study (as described below) for further comparative analyses.

2.5. Identification of Enzymes from Protist Transcriptomes

Enzyme names and their recognized synonyms were compiled from the ExPASy database and are summarized in Table 2. These terms were used as search queries to screen the annotation results obtained above, using a custom Python (version 3.12) script developed in-house. Transcripts containing enzyme hits were collected, and their corresponding protein sequences were retrieved from the TransDecoder output described in Section 2.2. Gene identifiers were reformatted to explicitly include the supergroup and species names for clarity. The curated sequences from this study were then merged with the reference sequences retrieved from the UniProtKB/Swiss-Prot database for downstream comparative analyses.

2.6. Sequence Similarity Network Analysis

For each enzyme class, protein sequences were combined into a single FASTA file and submitted to the EFI-EST webserver to construct Sequence Similarity Networks (SSNs). BLASTp searches were performed using the default e-value threshold of 1 × 10−5. The initial Alignment Score Threshold, which defines the minimum similarity required for an edge to be drawn between two nodes, was set to 35. This relatively low threshold was selected based on EFI-EST guidelines to prevent oversplitting of isofunctional families, thereby maintaining biologically meaningful clustering for downstream analysis. The resulting SSN files were imported into Cytoscape (version 3.10.4) for visualization and network refinement. The Prefuse Force-Directed layout was applied, and parameters such as spring length and node mass were manually adjusted to improve network separation and clarity. Node shapes were used to distinguish sequences from UniProtKB/Swiss-Prot versus those obtained in this study, while node colors represented the respective eukaryotic supergroups. Nodes with fewer connecting edges—representing more unique or distantly related sequences—were repositioned when necessary to enhance interpretability. Distinct networks were organized and ranked based on their sizes.

3. Results

3.1. Protist Species Identification

As summarized in Table 1, four green algal strains belonging to the Archaeplastida lineage were successfully isolated. These included one strain identified as Haematococcus lacustris, two strains as Chlamydomonas parallestriata, and one strain as Desmodesmus pannonicus. One Cryptista strain (Cryptophyceae) was determined to be Hemiselmis andersenii, and a single Haptista strain (Haptophyceae) was identified as Prymnesium parvum. Among the nine TSAR strains, two members of Chrysophyceae were isolated and both assigned to Ochromonas vasocystis. Five diatom strains were recovered: three were identified as Nitzschia sp., one as Sellaphora pupula-like, and one as Amphora subtropica-like based on sequence similarity. Additionally, one Dinophyceae strain (Amphidinium carterae) and one Eustigmatophyceae strain (Nannochloropsis oceanica) were identified. An overview of the in silico workflow is shown in Figure 1.

3.2. Sequence Similarity Network Analysis

3.2.1. Network Size and Node Connectivity as General Indicators of Enzyme Novelty

The SSN summary (Table 3) provides complementary indicators for prioritizing enzyme families and subnetworks for novel enzyme mining, but key patterns should be confirmed using the full SSNs in the Supplementary Figures. First, network counts by size provide a simple heuristic: enzyme categories with more small networks and singletons are more likely to include sequences with limited similarity to public databases and thus higher discovery potential. However, this depends on the composition of each small network, because clusters dominated by public sequences do not necessarily reflect protist-driven novelty. Second, the percentage of total nodes per network size makes the same point more explicitly by showing how sequences are distributed across major, medium-sized, small, and singleton networks. A greater fraction of nodes in small networks and singletons suggests more fragmented, low-similarity sequence space. For example, lipases show substantial node fractions outside major networks (medium-sized 28.8%, small 22.0%, singletons 11.6%), whereas catalases are concentrated in major networks (91.2%). Third, the percentage of protist-derived nodes among all nodes provides an essential baseline for each enzyme dataset. This baseline is required to interpret network-size-specific enrichment, because protist weighting differs strongly among enzymes (e.g., lipase 93.3%, xylanase 84.8% vs. pectinase 38.6%, laccase 31.9%, cellulase 61.2%).
Accordingly, the percentage of protist-derived nodes per network size, when compared to the baseline, indicates whether protist sequences preferentially occupy smaller networks. Enrichment of protist-derived nodes in small networks and/or singletons relative to baseline supports higher novelty potential, as seen for cellulase (61.2% overall vs. 80.2% in small networks), mannanase (53.2% vs. 100% in singletons), catalase (56.0% vs. 72.2% in small and 75% in singletons), laccase (31.9% vs. 50% in small), and amylase (56.6% vs. 81.5% in medium-sized). Conversely, reduced protist representation in major networks (e.g., esterase: 70.8% overall vs. 30.6% in major networks) suggests protist sequences are shifted toward smaller networks, which is favorable for mining.
Finally, the edges-to-nodes ratio (number of neighbors) serves as a proxy for within-network similarity, where lower values suggest greater divergence and higher novelty potential, but should be interpreted alongside detailed SSNs. Overall, combining baseline protist contribution with size-specific protist enrichment and connectivity provides a practical framework for prioritizing candidates for downstream validation.
Table 3. Summary of Sequence Similarity Network analysis, in which networks are categorized by size and node degree (number of neighbors) is used as an indicator of relative sequence novelty for each enzyme type.
Table 3. Summary of Sequence Similarity Network analysis, in which networks are categorized by size and node degree (number of neighbors) is used as an indicator of relative sequence novelty for each enzyme type.
Enzyme NameTotal NodesTotal EdgesNetwork SizeNumbers of NetworksPercentage of Total Nodes per Network SizePercentage of Protist-Derived Nodes Among All NodesPercentage of Protist-Derived Nodes per Network SizeNumber of Neighbors
(Edges/Nodes Ratio)
Cellulase124130,233Major (>30 nodes)566.6%61.2%54.2%40.986
Medium-sized (6–30 nodes)2119.8%75.2%16.08
Small (≤5 nodes)338.5%80.2%2.8
Single645.2%64.1%0
Xylanase3163024Major145.60%84.80%83.30%36.75
Medium-sized725.30%91.30%10.556
Small1413.00%70.70%4
Single5116.10%90.20%0
Pectinase1452168Major155.90%38.60%0.00%59
Medium-sized329.70%97.70%9.368
Small510.30%66.70%3.6
Single64.10%66.70%0
Mannanase1882300Major271.80%53.20%47.40%37.114
Medium-sized29.60%66.70%7.5
Small813.30%56.00%2
Single105.30%100.00%0
Catalase34116,439Major291.20%56.00%54.30%144.732
Medium-sized00.00%//
Small55.30%72.20%4
Single123.50%75%0
Lipase307019,058Major1837.50%93.30%89.30%55.788
Medium-sized9528.80%95.00%10.952
Small24422.00%97.00%4
Single35711.60%94.70%0
Laccase1382719Major185.50%31.90%31.40%45.983
Medium-sized00%//
Small24.30%50.00%2
Single1410.10%28.60%0
Esterase4394374Major233.50%70.80%30.60%69.095
Medium-sized1433.00%89.70%13.478
Small3421.20%90.30%4
Single5412.30%96.30%0
Amylase3163955Major364.20%56.60%46.30%37.685
Medium-sized38.50%81.50%10
Small1210.40%69.70%2.8
Single5316.80%75.50%0

3.2.2. Detailed SSNs Reveals Potential Novel Enzymes

Figure 2 provides an overview of where potential novelty of protist-derived enzymes may occur across different SSN size categories. Protist-derived sequences present as singletons are, by definition, not connected to UniProtKB/Swiss-Prot sequences under the applied threshold and therefore represent high novelty potential. In major networks, all enzyme types show high connectivity among nodes, indicating that protist-derived sequences within these networks largely overlap with UniProtKB/Swiss-Prot sequence space and thus have limited sequence-level novelty. Within medium-sized networks, cellulases, mannanases, esterases, and TSAR-derived amylases include protist-derived nodes that are separated from the main UniProtKB/Swiss-Prot clusters, suggesting increased novelty potential. Within small networks, separated protist-derived nodes are observed for cellulases, TSAR-derived xylanases, Archaeplastida-derived pectinases, TSAR-derived mannanases, catalases, lipases, laccases, esterases, and most amylases (except Cryptista-derived sequences), collectively indicating that these low-membership networks may be enriched for protist-origin sequences with higher novelty potential.
For cellulases (Figure 2, top panel; Supplementary Figure S1), the full SSN revealed five major networks (each containing >30 nodes), 26 medium-sized networks (6–30 nodes), 28 small networks (≤5 nodes), and 64 singletons. The largest network was predominantly composed of sequences from Viridiplantae and Haptista, with smaller contributions from TSAR and several other eukaryotic lineages. The second-largest network mainly contained Fungal cellulases, alongside TSAR and a minor proportion of other supergroups. The third-largest cluster again consisted primarily of Viridiplantae cellulases, followed by nearly equal proportions of TSAR and Archaeplastida sequences. The fourth-largest network was enriched in Archaeplastida cellulases, with additional TSAR sequences and limited representation from Fungi, Haptista, and Cryptista. The fifth major network was dominated by Fungal cellulases, supplemented by Haptista, TSAR, and Metazoa sequences. Among the medium-sized networks, three clusters were composed mostly of Fungal cellulases, and three others consisted primarily of Bacterial cellulases, while the remaining networks were formed by cellulases from one or a few protist supergroups. Small networks similarly tended to include sequences from a single supergroup or a combination of only two or fewer groups, except for three clusters consisting exclusively of Bacterial cellulases. Regarding singletons, 23 of the 64 sequences originated from UniProtKB/Swiss-Prot, whereas the remaining 41 represented enzyme candidates identified exclusively from the protist transcriptomes in this study.
For xylanases (Figure 2 second panel; Supplementary Figure S2), the full SSN revealed one major network, seven medium-sized networks, fourteen small networks, and fifty-one singletons. The major cluster consisted predominantly of sequences from Haptista, TSAR, and Fungi, with a minor contribution from Archaeplastida.
Among the seven medium-sized networks, two clusters contained mixtures of Bacterial sequences, whereas the remaining networks were composed of xylanases from one or, at most, two protist supergroups. Within the fourteen small networks, five exhibited a mixture of Bacterial sequences, while the others consisted of a single protist supergroup. Regarding singletons, five out of fifty-one were derived from UniProtKB/Swiss-Prot, representing Fungal, Bacterial, or Viridiplantae origins. The remaining forty-six singleton nodes corresponded exclusively to xylanase candidates identified from the protist transcriptomes analyzed in this study.
For pectinases (Figure 2 thrid panel; Supplementary Figure S3), the full SSN identified one major network, four medium-sized networks, five small networks, and six singletons. The major network consisted almost entirely of Fungal pectinases, with a single sequence from Metazoa. Among the four medium-sized networks, one cluster contained only Viridiplantae sequences, another was composed exclusively of TSAR sequences, one included members from three different protist supergroups, and the remaining network consisted mainly of Haptista and TSAR sequences with one additional Viridiplantae sequence. For the small networks, one cluster comprised solely Bacterial pectinases, one consisted of four TSAR sequences together with one Viridiplantae sequence, and the remaining three clusters contained sequences from a single protist supergroup. Of the six singletons, one originated from Bacteria, one from Metazoa, and the remaining four represented pectinase candidates exclusively identified from protists in this study.
For mannanases (Figure 2 fourth panel; Supplementary Figure S4), the full SSN identified two major networks, two medium-sized networks, eight small networks, and ten singletons. The first major network consisted of TSAR, Archaeplastida, and Fungal mannanases, along with smaller contributions from Metazoa and Haptista. The second major network was dominated by Viridiplantae and Fungal sequences, followed by additional representatives from Haptista and Archaeplastida. Among the two medium-sized networks, one was composed exclusively of TSAR sequences, while the other primarily consisted of Fungal sequences with a single Bacterial representative. Within the small networks, three clusters contained only Bacterial mannanases, one consisted solely of Haptista sequences, two comprised TSAR sequences only, and one included a mixture of Haptista and Metazoa sequences. All ten singleton nodes represented mannanase candidates derived exclusively from protists analyzed in the present study.
For catalases (Figure 2 fifth panel; Supplementary Figure S5), the full SSN identified two major networks, five small networks, and twelve singletons. The first major network consisted primarily of Bacterial, Viridiplantae, and Metazoan catalases, with additional but limited representation from Archaeplastida, TSAR, and Archaea. The second major network comprised exclusively protist sequences, dominated by Haptista and TSAR, with smaller contributions from Archaeplastida and Cryptista. Among the five small networks, two clusters contained only Bacterial catalases, while the remaining three were composed of sequences from one or, at most, two protist supergroups. Regarding singletons, two sequences originated from Bacteria and one from Viridiplantae, whereas the remaining nine singleton nodes represented catalase candidates exclusively identified from the protists analyzed in this study.
For lipases (Figure 2 sixth panel; Supplementary Figure S6), the full SSN revealed eighteen major networks, ninety-five medium-sized networks, 244 small networks, and 357 singletons. The largest major network consisted exclusively of protist lipases, dominated by Haptista and TSAR, with smaller contributions from Archaeplastida. The second-largest cluster contained mostly TSAR sequences with additional Haptista and Archaeplastida members, along with a few Viridiplantae and Fungal representatives. The third major network also consisted solely of protist lipases, predominantly TSAR, followed by smaller fractions of Archaeplastida and Cryptista. The fourth major network showed a similar trend, being mainly composed of TSAR members and limited representatives of Haptista and Archaeplastida. The fifth cluster contained nearly equal contributions from multiple protist supergroups but included a few Fungal and Metazoan lipases. The sixth major network consisted exclusively of protist sequences, largely from TSAR, with additional Haptista, Archaeplastida, and Cryptista sequences. The seventh network also included only protist lipases, with comparable numbers from Archaeplastida and TSAR. The eighth cluster was mainly composed of Archaeplastida and Metazoan members, with limited contributions from TSAR and Haptista. The ninth through twelfth major networks consisted exclusively of protist lipases representing multiple supergroups. The thirteenth network was mainly composed of Haptista and TSAR sequences, along with a minor number of Metazoan sequences. The fourteenth through seventeenth networks included only protist sequences, while the eighteenth major network consisted exclusively of Fungal lipases. Among the medium-sized networks, twenty-four clusters were composed solely of TSAR members, fourteen were exclusively Archaeplastida, and one consisted almost entirely of Archaeplastida except for a single TSAR sequence. Additionally, two networks comprised only Metazoan lipases, one consisted exclusively of Fungal sequences, one contained only Bacterial sequences, and one included only Haptista sequences. Of the remaining medium-sized networks, only one contained UniProtKB/Swiss-Prot (Metazoan) sequences, while the others contained lipases from one or multiple protist supergroups. For the small networks, except for one composed solely of Metazoan sequences, five that contained only Bacterial sequences, and two clusters consisting exclusively of Fungal sequences, the remaining networks were formed by lipases belonging to one or, at most, two protist supergroups. Regarding singletons, aside from six Bacterial, eleven Fungal, and one Viridiplantae sequence, all remaining singleton nodes represented protist lipases detected exclusively in this study.
For laccases (Figure 2 seventh panel; Supplementary Figure S7), the full SSN identified one major network, two small networks, and fourteen singletons. The major network was predominantly composed of Viridiplantae and Fungal sequences, with smaller contributions from Archaeplastida and TSAR. Of the two small networks, one consisted solely of Archaeplastida sequences, whereas the other was a mixed cluster containing both Bacterial and Archaeal laccases. Among the fourteen singletons, four originated from TSAR, while the remaining ten represented Fungal laccases.
For esterases (Figure 2 eighth panel; Supplementary Figure S8), the full SSN identified two major networks, fourteen medium-sized networks, thirty-four small networks, and fifty-four singletons. The first major network was primarily composed of Viridiplantae sequences with a few additional representatives from Fungi. The second major network included a mixture of Archaeplastida, TSAR, Haptista, and Metazoa sequences. Among the fourteen medium-sized networks, two clusters were dominated by Metazoan proteins, while the remaining networks consisted of esterases from one or, at most, three protist supergroups. Regarding the small networks, aside from two clusters containing Bacterial esterases and one cluster composed solely of Fungal sequences, all others consisted of esterases from one or two protist supergroups. Among the fifty-four singletons, two originated from Viridiplantae, while the remaining fifty-two represented esterase candidates detected exclusively from protists analyzed in this study.
For amylases (Figure 2 bottom panel; Supplementary Figure S9), the full SSN identified three major networks, three medium-sized networks, twelve small networks, and fifty-three singletons. The first major network was largely composed of Haptista, Fungal, Archaeplastida, and TSAR sequences, with a few additional Metazoan members. The second major network mainly consisted of Metazoan sequences, accompanied by smaller numbers of TSAR and Bacterial amylases. The third major network was dominated by Archaeplastida and Viridiplantae sequences, together with a limited number of TSAR representatives. Among the three medium-sized networks, two clusters consisted exclusively of TSAR sequences, whereas the remaining cluster contained both Bacterial and Archaeplastida sequences. Within the twelve small networks, one cluster included Bacterial and Cryptista sequences, another contained Bacterial and Archaeal sequences, and one consisted solely of Viridiplantae members. All other small clusters were composed of amylases originating from protist supergroups analyzed in this study. For the fifty-three singletons, five were assigned to Bacteria, six to Viridiplantae, one to Metazoa, and one to Archaea, while the remaining forty singleton nodes corresponded to amylase candidates uniquely detected in the protist transcriptomes generated in this study.

4. Discussion

4.1. Species-Level Distribution of Targeted Enzyme Candidates

Although not directly related to the main objectives of this study, it is noteworthy that the average number of enzyme candidates per species differs considerably among protist lineages. For example, Haptophyceae showed the highest per-species average for cellulases (263/1), xylanases (84/1), catalases (61/1), esterases (49/1), and lipases (438/1), suggesting a broad enzymatic capacity that may reflect their metabolic versatility in diverse marine environments. Dinophyceae also exhibited high per-species averages of cellulases (73/1) and lipases (319/1), consistent with their complex trophic strategies such as mixotrophy [30]. By comparison, Eustigmatophyceae displayed high diversity of cellulase (79/1) and relatively high diversity of xylanase (15/1) activities, yet no pectinases or mannanases were detected, indicating a potential specialization in limited polysaccharide substrates. In contrast, Chlorophyceae consistently showed moderate enzyme diversity but a high average for lipases (715/4 ≈ 179 per species) and amylases (76/4 = 19 per species), aligning with their known reliance on storage lipids and carbohydrate metabolism. These lineage-specific enzyme profiles may reflect ecological and physiological adaptations—an interesting avenue for future investigation. Furthermore, this comparative trend provides preliminary insight into which protist supergroups may yield greater enzymatic diversity and should be prioritized in future bioprospecting efforts.

4.2. Evidence for Sequence-Level Novelty in Protist-Derived Enzymes

Sequence Similarity Network (SSN) analysis essentially conducts comprehensive pairwise BLASTp comparisons among all sequences within a dataset. This approach enables the visualization of distinct clusters based on sequence similarity relationships. The EFI-EST platform provides an API (Application Programming Interface) that allows users to specify an Alignment Score Threshold, which determines whether an edge will be formed between two nodes in the network. Adjusting this threshold can either keep isofunctional sequences clustered together or separate them into smaller components or even singletons (https://efi.igb.illinois.edu/efi-est/tutorial_analysis.php, accessed on 19 November 2025). Since the most appropriate threshold is context-dependent, no universal cutoff value exists; rather, the choice should be tailored to the biological question being addressed. It should also be noted that SSNs have limitations: network structure can be biased by the selected similarity threshold (often derived from BLAST-based metrics), and BLAST e-values in particular depend on database size, making direct comparisons of e-value cutoffs across SSNs generated from different database sizes potentially misleading. In addition, different graph layout algorithms can produce visually distinct network geometries, which may influence interpretation. In this study, our goal was to determine whether the protist-derived enzyme candidates identified here resemble known sequences in public databases. We selected the recommended threshold of 35, which effectively differentiated sequences that are highly similar (i.e., presumably not evolutionarily novel) to known enzymes from those forming isolated networks composed exclusively of protist-derived sequences. The latter subset may represent promising targets for future discovery of novel enzymatic functions.
Across all nine enzyme categories examined, the SSN analyses consistently revealed clusters composed uniquely of protist-origin sequences, suggesting the presence of previously uncharacterized enzymes (has not yet been evaluated by in silico analyses (e.g., structure prediction) or experimental validation). However, we emphasize that “novelty” here is based solely on sequence-level similarity, which does not necessarily correlate with biochemical function. Alternative approaches—such as HMM-based motif detection, tertiary-structure modeling, and enzyme superfamily classification—can provide complementary predictions, but none of these methods can definitively determine function [31]. Ultimately, experimental validation, such as heterologous expression or advanced cell-free assay systems, is required to confirm enzymatic activity [32].
Importantly, protists represent an enormous and insufficiently explored reservoir of eukaryotic genetic diversity. Compared with computational design approaches, which face the substantial challenge of almost limitless sequence-function possibilities, mining naturally evolved enzymes from protists offers a more feasible strategy for discovering novel biocatalysts. Given their extensive yet under-studied representation across the eukaryotic tree of life, protists constitute a largely untapped “treasure chest” for innovative enzyme discovery.

4.3. Benefits of Axenic, Single-Strain Transcriptome Resources

This study builds upon a single-cell isolation technology that traces back to techniques developed by one of the authors nearly four decades ago. The method was first applied to the single-cell “pickup” of dinophycean cysts [33]. This technique was later passed down within our institute (Seed Bank Co., Ltd., Kyoto, Japan) and adapted nearly a decade ago [33] for isolating resting spores of diatoms—structures analogous to the seeds of terrestrial plants. By culturing these resting spores, we were able to obtain vegetative cells, which are the forms most commonly sampled and observed in environmental studies. This approach enabled us to establish direct relationships between resting spores and vegetative forms in specific diatom species. In the present study, we expanded this method to isolate individual cells of diverse protist taxa across multiple eukaryotic supergroups. Establishing strains from single cells provides several major advantages compared with mining enzymes solely from metagenomic or metatranscriptomic datasets. First, single-cell derived cultures effectively eliminate contamination and minimize the need for extensive computational filtering to remove sequences from non-target organisms, although in rare cases such as our previous work on foraminifera the calcified test structure harbors associated eukaryotes and bacteria, making contamination unavoidable and requiring substantial in silico effort to identify genuine foraminiferal sequences [34]. Second, many eukaryotic enzymes are difficult to express functionally in heterologous systems, even with carefully designed constructs [29]. In some cases, enzymatic activity is detectable only when the complete biosynthetic gene cluster (BGC) is expressed, rather than when genes are expressed individually; this constraint appears to be particularly relevant for protist-derived pathways [35]. Maintaining axenic cultures of the producing organisms provides an alternative means to obtain enzymes of interest directly through purification from the native host. Third, knowing the biological origin of a promising enzyme can strategically guide further discovery. Origin information offers biological context—such as environmental conditions and physiological traits—that can accelerate the search for enhanced variants. For instance, if an experimentally validated enzyme produced through heterologous expression exhibits high salinity tolerance, related species inhabiting tidal flats, where salinity fluctuates substantially during low tide, could be targeted to identify potentially more robust variants of that enzyme. Thus, single-cell protist isolation not only ensures data accuracy but also creates a renewable biological source of understudied enzymes, connecting sequence discovery with ecological insight and practical biotechnological potential.

4.4. Challenge and Prospective

Despite the advantages of single-cell protist isolation, several challenges must be addressed to fully unlock the potential of protist-derived enzymes. First, the cost of strain maintenance remains a major bottleneck to date. Cryopreservation techniques for protists are still poorly developed, and many species cannot survive freezing and thawing without significant mortality. As a result, continuous subculturing is often the only viable approach for long-term preservation. This practice demands substantial time, personnel effort, and facility space. Discarding strains after genome or transcriptome sequencing may reduce maintenance costs, but doing so forfeits the benefits of having axenic cultures for downstream functional studies and biotechnological applications. Second, large-scale cultivation presents practical limitations. If a promising enzyme cannot be efficiently produced through heterologous expression, direct cultivation of the native protist host may be necessary. Although our institute (Seed Bank Co., Ltd.) has strong expertise in maintaining a broad range of protist strains under axenic conditions at laboratory volumes (<10 L), scaling up production would require specialized infrastructure, such as large clean-room environments, which would impose substantial operational costs. Some protists, such as Chlorella, Spirulina, and Euglena, are robust and tolerant to extreme environmental conditions that suppress competing microorganisms, making them relatively easy to cultivate. In contrast, many protists—particularly those with high nutritional value or specialized ecological roles—are highly vulnerable to predation and contamination. This vulnerability may become a critical bottleneck for enzyme biomanufacturing using native protist hosts. Lastly, protist model systems and genetic tools are still in their infancy. Metabolic pathways in most protists remain poorly characterized [36,37]. While this represents an exciting opportunity to discover novel bioactive compounds and enzymes with unique properties, it also means that advancing protists into reliable expression platforms will require additional time and technological development.

5. Conclusions

Protists represent a major yet underexplored reservoir of enzymatic diversity. Using single-cell isolation, we established axenic strains across four eukaryotic supergroups and generated de novo transcriptomes. By focusing on nine enzyme classes with industrial relevance in papermaking and applying Sequence Similarity Network analysis, we identified evolutionarily distinct clusters composed solely of protist-derived sequences, indicating novel enzyme candidates. Cultivating strains from single cells eliminates contamination issues and enables future biochemical characterization, which is particularly important for protist enzymes that may be difficult to express in heterologous hosts. However, limitations in cryopreservation and the high cost of large-scale cultivation in industrial settings remain major obstacles to practical enzyme production. Overall, this work demonstrates that protists are a valuable source of previously uncharacterized enzymes and highlights the importance of advancing cultivation and functional screening technologies to unlock their full industrial potential.

Supplementary Materials

The following supporting information can be downloaded: https://www.mdpi.com/article/10.3390/fermentation12020071/s1. Figures S1–S9: Sequence Similarity Network (SSN) constructed from UniProtKB/Swiss-Prot and present study amino acid sequences for cellulases, xylanases, pectinases, mannanases, catalases, lipases, laccases, esterases, and amylases, corresponding to the EC numbers listed in Table 2. Hexagons represent sequences obtained from the UniProtKB/Swiss-Prot database, whereas circles represent sequences originating from this study. Networks were categorized into four size classes: major networks (≥30 members), medium-sized networks (6–30 members), small networks (2–5 members), and singletons (one sequence). Node positions were initially arranged using Cytoscape’s automated layout and subsequently adjusted manually to improve visual clarity. Nodes with fewer edges (i.e., lower similarity connectivity) are placed toward the periphery, whereas nodes with higher connectivity cluster near the center. Hexagonal nodes are color-coded into seven taxonomic groups—Archaea, Amoebozoa, Bacteria, Fungi, Metazoa, Viridiplantae, and TSAR—following the origin categories used by the UniProtKB/Swiss-Prot database, with only groups present for each enzyme displayed in the legend. Circular nodes (present study sequences) are likewise color-coded into seven protist classes: Bacillariophyceae, Chlorophyceae, Chrysophyceae, Cryptophyceae, Dinophyceae, Eustigmatophyceae, and Haptophyceae. For xylanases, the SSN analysis revealed one major network, seven medium-sized networks, fourteen small networks, and fifty-one singletons. The major cluster consisted predominantly of sequences from Haptista, TSAR, and Fungi, with a minor contribution from Archaeplastida. File S1: Amino acid sequences of all enzymes analyzed in this study, including both the sequences downloaded from UniProtKB/Swiss-Prot and those extracted from the de novo transcriptome assemblies generated in the present work. File S2: EFI-EST network files in .xgmml format. Sequence IDs correspond exactly to those listed in File S1. File S3: Cytoscape project files containing the complete SSN visualizations used in the manuscript. These files can be opened directly with Cytoscape. Additional annotation columns are included: DataSource (distinguishing UniProtKB/Swiss-Prot sequences from transcriptome-derived sequences in this study), Group (categorizing enzymes by organismal origin), and Simplified description (used as node labels for clarity).

Author Contributions

Conceptualization, M.W.L.T.; methodology, M.W.L.T. and M.K.; software, M.W.L.T.; validation, M.W.L.T. and K.M.; formal analysis, M.W.L.T. and M.K.; investigation, M.W.L.T.; resources, K.M.; data curation, M.W.L.T.; writing—original draft preparation, M.W.L.T.; writing—review and editing, M.W.L.T.; visualization, M.W.L.T.; supervision, K.M.; project administration, K.M.; funding acquisition, K.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All sequencing data (excluding the whole-transcriptome assembly files, as explained in the Section 2) and all analysis outputs generated in this study, including EFI-EST network files and Cytoscape project files, are provided in the Supplementary Materials.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (version 5.1 Plus) to assist with English proofreading, content refinement, and generation of Python scripts for in silico analysis. All AI-generated content was reviewed, edited, and verified by the authors, who take full responsibility for the final version of this publication.

Conflicts of Interest

Author Manabu W. L. Tanimura, Motoki Kayama and Kazumi Matsuoka are employed by Seed Bank Co., Ltd.

Abbreviations

The following abbreviations are used in this manuscript:
SSNSequence Similarity Network
ECEnzyme Commission

References

  1. Himmel, M.E.; Ding, S.Y.; Johnson, D.K.; Adney, W.S.; Nimlos, M.R.; Brady, J.W.; Foust, T.D. Biomass Recalcitrance: Engineering Plants and Enzymes for Biofuels Production. Science 2007, 315, 804–807. [Google Scholar] [CrossRef]
  2. Bukhari, I.; Haq, F.; Kiran, M.; Aziz, T.; Mehmood, S.; Haroon, M. Lignocellulosic biomass as a renewable resource: Driving second-generation biofuel innovation from agricultural waste. Biomass Bioenergy 2025, 201, 108133. [Google Scholar] [CrossRef]
  3. Zhou, Z.; Liu, D.; Zhao, X. Conversion of lignocellulose to biofuels and chemicals via sugar platform: An updated review on chemistry and mechanisms of acid hydrolysis of lignocellulose. Renew. Sustain. Energy Rev. 2021, 146, 111169. [Google Scholar] [CrossRef]
  4. Toda, M.; Takagaki, A.; Okamura, M.; Kondo, J.; Hayashi, S.; Domen, K.; Hara, M. Biodiesel made with sugar catalyst. Nature 2005, 438, 178. [Google Scholar] [CrossRef]
  5. Xu, N.; Liu, S.; Xin, F.; Zhou, J.; Jia, H.; Xu, J.; Jiang, M.; Dong, W. Biomethane Production from Lignocellulose: Biomass Recalcitrance and Its Impacts on Anaerobic Digestion. Front. Bioeng. Biotechnol. 2019, 7, 191. [Google Scholar] [CrossRef]
  6. Liu, Y.; Min, J.; Feng, X.; He, Y.; Liu, J.; Wang, Y.; He, J.; Do, H.; Sage, V.; Yang, G.; et al. A Review of Biohydrogen Productions from Lignocellulosic Precursor via Dark Fermentation: Perspective on Hydrolysate Composition and Electron-Equivalent Balance. Energies 2020, 13, 2451. [Google Scholar] [CrossRef]
  7. Lynd, L.R.; Weimer, P.J.; van Zyl, W.H.; Pretorius, I.S. Microbial Cellulose Utilization: Fundamentals and Biotechnology. Microbiol. Mol. Biol. Rev. 2002, 66, 506–577. [Google Scholar] [CrossRef]
  8. Setegn, H.; Abate, A. Pectinase from Microorganisms and Its Industrial Applications. Sci. World J. 2022, 2022, 1881305. [Google Scholar] [CrossRef] [PubMed]
  9. Sun, Y.; Cheng, J. Hydrolysis of lignocellulosic materials for ethanol production: A review. Bioresour. Technol. 2002, 83, 1–11. [Google Scholar] [CrossRef]
  10. Kenealy, W.R.; Jeffries, T.W. Enzyme Processes for Pulp and Paper: A Review of Recent Developments. In Wood Deterioration and Preservation; ACS Symposium Series 845; American Chemical Society: Washington, DC, USA, 2003; Chapter 12; pp. 210–239. [Google Scholar] [CrossRef]
  11. Gutierrez, A.; Del Rio, J.C.; Martinez, A.T. Microbial and enzymatic control of pitch in the pulp and paper industry. Appl. Microbiol. Biotechnol. 2009, 82, 1005–1018. [Google Scholar] [CrossRef]
  12. Coker, J.A. Extremophiles and biotechnology: Current uses and prospects. F1000 Res. 2016, 5, 396. [Google Scholar] [CrossRef]
  13. Siddiqui, K.S.; Cavicchioli, R. Cold-adapted enzymes. Annu. Rev. Biochem. 2006, 75, 403–433. [Google Scholar] [CrossRef]
  14. Antranikian, G.; Egorova, K. Extremophiles, a Unique Resource of Biocatalysts for Industrial Biotechnology. In Physiology and Biochemistry of Extremophiles; Gerday, C., Glansdorff, N., Eds.; ASM Press: Washington, DC, USA, 2007; Chapter 27. [Google Scholar] [CrossRef]
  15. Ishino, S.; Ishino, Y. DNA polymerases as useful reagents for biotechnology—The history of developmental research in the field. Front. Microbiol. 2014, 5, 465. [Google Scholar] [CrossRef]
  16. Chalfie, M.; Tu, Y.; Euskirchen, G.; Ward, W.W.; Prasher, D.C. Green Fluorescent Protein as a Marker for Gene Expression. Science 1994, 263, 802–805. [Google Scholar] [CrossRef] [PubMed]
  17. Rigoldi, F.; Donini, S.; Redaelli, A.; Parisini, E.; Gautieri, A. Engineering of thermostable enzymes for industrial applications. APL Bioeng. 2018, 2, 011501. [Google Scholar] [CrossRef] [PubMed]
  18. Contesini, F.J.; Davanço, M.G.; Borin, G.P.; Vanegas, K.G.; Cirino, J.P.G.; Melo, R.R.; Mortensen, U.H.; Hildén, K.; Campos, D.R.; Carvalho, P.O. Advances in Recombinant Lipases: Production, Engineering, Immobilization and Application in the Pharmaceutical Industry. Catalysts 2020, 10, 1032. [Google Scholar] [CrossRef]
  19. Ndochinwa, O.G.; Wang, Q.Y.; Amadi, O.C.; Nwagu, T.N.; Nnamchi, C.I.; Okeke, E.S.; Moneke, A.N. Current status and emerging frontiers in enzyme engineering: An industrial perspective. Heliyon 2024, 10, e32673. [Google Scholar] [CrossRef] [PubMed]
  20. Tanimura, A.; Liu, W.; Yamada, K.; Kishida, T.; Toyohara, H. Animal cellulases with a focus on aquatic invertebrates. Fish. Sci. 2013, 79, 1–13. [Google Scholar] [CrossRef]
  21. Liu, W.; Nafisyah, A.L.; Koike, K.; Matsuoka, K. Reappraisal of cellulase activities in mangrove wetlands resulting from preliminary investigations in East Java, Indonesia. Bull. Osaka Mus. Nat. Hist. 2021, 75, 15–27. [Google Scholar]
  22. Caron, D.; Worden, A.; Countway, P.; Demir, E.; Heidelberg, K.B. Protists are microbes too: A perspective. ISME J. 2009, 3, 4–12. [Google Scholar] [CrossRef]
  23. The UniProt Consortium. UniProt: The Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023, 51, D523–D531. [Google Scholar] [CrossRef]
  24. Burki, F.; Roger, A.J.; Brown, M.W.; Simpson, A.G.B. The new tree of eukaryotes. Trends Ecol. Evol. 2020, 35, 43–55. [Google Scholar] [CrossRef]
  25. Zallot, R.; Oberg, N.; Gerlt, J.A. The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways. Biochemistry 2019, 58, 4169–4182. [Google Scholar] [CrossRef]
  26. Atkinson, H.J.; Morris, J.H.; Ferrin, T.E.; Babbitt, P.C. Using Sequence Similarity Networks for Visualization of Relationships Across Diverse Protein Superfamilies. PLoS ONE 2009, 4, e4345. [Google Scholar] [CrossRef]
  27. Pawlowski, J.; Audic, S.; Adl, S.; Bass, D.; Belbahri, L.; Berney, C.; Bowser, S.S.; Cepicka, I.; Decelle, J.; Dunthorn, M.; et al. CBOL Protist Working Group: Barcoding Eukaryotic Richness beyond the Animal, Plant, and Fungal Kingdoms. PLoS Biol. 2012, 10, e1001419. [Google Scholar] [CrossRef]
  28. Gao, X.; Chen, K.; Xiong, J.; Zou, D.; Yang, F.; Ma, Y.; Jiang, C.; Gao, X.; Wang, G.; Gu, S.; et al. The P10K database: A data portal for the protist 10 000 genomes project. Nucleic Acids Res. 2024, 52, D747–D755. [Google Scholar] [CrossRef]
  29. Ishii, K.I.; Imai, I.; Natsuike, M.; Sawayama, S.; Ishino, R.; Liu, W.; Fukusaki, K.; Ishikawa, A. A simple technique for establishing axenic cultures of centric diatoms from resting stage cells in bottom sediments. Phycologia 2018, 57, 674–679. [Google Scholar] [CrossRef]
  30. Jeong, H.J.; du Yoo, Y.D.; Kim, J.S.; Seong, K.A.; Kang, N.S.; Kim, T.H. Growth, feeding and ecological roles of the mixotrophic and heterotrophic dinoflagellates in marine planktonic food webs. Ocean Sci. J. 2010, 45, 65–91. [Google Scholar] [CrossRef]
  31. Robinson, S.L.; Piel, J.; Sunagawa, S. A roadmap for metagenomic enzyme discovery. Nat. Prod. Rep. 2021, 38, 1994–2023. [Google Scholar] [CrossRef] [PubMed]
  32. Khanal, A.; McLoughlin, S.Y.; Kershner, J.P.; Copley, S.D. Differential Effects of a Mutation on the Normal and Promiscuous Activities of Orthologs: Implications for Natural and Directed Evolution. Mol. Biol. Evol. 2015, 32, 100–108. [Google Scholar] [CrossRef]
  33. Matsuoka, K. Organic-walled dinoflagellate cysts from surface sediments of Nagasaki Bay and Senzaki Bay, West Japan. Bull. Fac. Lib. Arts Nagasaki Univ. 1985, 25, 21–115. [Google Scholar]
  34. Tanimura, M.W.L.; Nagai, Y.; Matsuoka, K.; Toyofuku, T. Endogenous glycoside hydrolases reveal foraminiferal capacity to degrade terrestrial and marine polysaccharides. ISME Commun. 2025, 5, ycaf149. [Google Scholar] [CrossRef] [PubMed]
  35. Keeling, P.J.; del Campo, J. Marine protists are not just big bacteria. Curr. Biol. 2017, 27, R541–R549. [Google Scholar] [CrossRef] [PubMed]
  36. Faktorová, D.; Nisbet, R.E.R.; Fernández Robledo, J.A.; Casacuberta, E.; Sudek, L.; Allen, A.E.; Ares, M.; Aresté, C.; Balestreri, C.; Barbrook, A.C.; et al. Genetic tool development in marine protists: Emerging model organisms for experimental cell biology. Nat. Methods 2020, 17, 481–494. [Google Scholar] [CrossRef]
  37. Schoenle, A.; Francis, O.; Archibald, J.M.; Burki, F.; de Vries, J.; Dumack, K.; Eme, L.; Florent, I.; Hehenberger, E.; Hoffmeyer, T.T.; et al. Protist genomics: Key to understanding eukaryotic evolution. Trends Genet. 2025, 41, 868–882. [Google Scholar] [CrossRef]
Figure 1. In silico workflow from protist cell culture to Sequence Similarity Network (SSN) analysis. Squares represent major analytical steps, while bold labels outside each square denote the software or tools used. Arrows indicate the direction of data flow, with accompanying text specifying the type of data transferred between steps.
Figure 1. In silico workflow from protist cell culture to Sequence Similarity Network (SSN) analysis. Squares represent major analytical steps, while bold labels outside each square denote the software or tools used. Arrows indicate the direction of data flow, with accompanying text specifying the type of data transferred between steps.
Fermentation 12 00071 g001
Figure 2. Simplified Sequence Similarity Networks (SSNs) constructed from UniProtKB/Swiss-Prot sequences and amino acid sequences obtained in this study for cellulases, xylanases, pectinases, mannanases, catalases, lipases, laccases, esterases, and amylases, corresponding to the EC numbers listed in Table 2. Nodes represent sequences (grouped by phylogenetic origin), and edges represent sequence similarity. Hexagonal nodes (UniProtKB/Swiss-Prot origin) are grouped into seven taxonomic categories—Archaea (red), Amoebozoa (pink), Bacteria (light green), Fungi (gray), Metazoa (blue), Viridiplantae (yellow), and TSAR (orange)—following UniProtKB/Swiss-Prot origin categories. Circular nodes (present study origin) are grouped into major eukaryotic lineages represented in this study: TSAR (orange, Bacillariophyceae, Chrysophyceae, Dinophyceae, and Eustigmatophyceae), Archaeplastida (green, Chlorophyceae), Cryptista (light purple, Cryptophyceae), and Haptista (purple, Haptophyceae). Nodes are color-coded for clarity, and the color scheme was strandardized across all enzyme types. Networks were classified into four size categories: major networks (≥30 nodes), medium-sized networks (6–30 nodes), small networks (2–5 nodes), and singletons (single sequence). Hexagons represent UniProtKB/Swiss-Prot sequences, whereas circles represent sequences from this study. In the catalase SSN, a single octagon is shown within the major network to indicate a network containing both UniProtKB/Swiss-Prot and protist-derived sequences. Node positions were initially generated using Cytoscape’s automated layout and were subsequently adjusted manually to improve visual clarity; therefore, node sizes, positions, and edge lengths have no statistical meaning. Full networks, in which each enzyme sequence is represented as an individual node, are provided in Supplementary Figures S1–S9.
Figure 2. Simplified Sequence Similarity Networks (SSNs) constructed from UniProtKB/Swiss-Prot sequences and amino acid sequences obtained in this study for cellulases, xylanases, pectinases, mannanases, catalases, lipases, laccases, esterases, and amylases, corresponding to the EC numbers listed in Table 2. Nodes represent sequences (grouped by phylogenetic origin), and edges represent sequence similarity. Hexagonal nodes (UniProtKB/Swiss-Prot origin) are grouped into seven taxonomic categories—Archaea (red), Amoebozoa (pink), Bacteria (light green), Fungi (gray), Metazoa (blue), Viridiplantae (yellow), and TSAR (orange)—following UniProtKB/Swiss-Prot origin categories. Circular nodes (present study origin) are grouped into major eukaryotic lineages represented in this study: TSAR (orange, Bacillariophyceae, Chrysophyceae, Dinophyceae, and Eustigmatophyceae), Archaeplastida (green, Chlorophyceae), Cryptista (light purple, Cryptophyceae), and Haptista (purple, Haptophyceae). Nodes are color-coded for clarity, and the color scheme was strandardized across all enzyme types. Networks were classified into four size categories: major networks (≥30 nodes), medium-sized networks (6–30 nodes), small networks (2–5 nodes), and singletons (single sequence). Hexagons represent UniProtKB/Swiss-Prot sequences, whereas circles represent sequences from this study. In the catalase SSN, a single octagon is shown within the major network to indicate a network containing both UniProtKB/Swiss-Prot and protist-derived sequences. Node positions were initially generated using Cytoscape’s automated layout and were subsequently adjusted manually to improve visual clarity; therefore, node sizes, positions, and edge lengths have no statistical meaning. Full networks, in which each enzyme sequence is represented as an individual node, are provided in Supplementary Figures S1–S9.
Fermentation 12 00071 g002
Table 1. Axenic protist cultures used in the present study.
Table 1. Axenic protist cultures used in the present study.
No.Eukaryotic Super GroupClassSpecies NameCulture Medium
1ArchaeplastidaChlorophyceaeChlamydomonas parallestriata strain 1AF6
2Chlamydomonas parallestriata strain 2AF6
3Desmodesmus pannonicusAF6
4Haematococcus lacustrisAF6
5CryptistaCryptophyceaeHemiselmis anderseniiIMK
6HaptistaHaptophyceaePrymnesium parvumIMKSi
7TSARChrysophyceaeOchromonas vasocystis strain 1AF6
8Ochromonas vasocystis strain 2AF6
9BacillariophyceaeNitzschia sp. 1IMKSi
10Nitzschia sp. 2IMKSi
11Nitzschia sp. 3IMKSi
12Sellaphora aff. pupulaIMKSi
13Amphora aff. subtropicaIMKSi
14DinophyceaeAmphidinium carteraeIMKSi
15EustigmatophyceaeNannochloropsis oceanicaIMK
Table 2. Industrially relevant enzyme categories analyzed in this study, including sequences from public databases and newly identified protist-derived enzymes.
Table 2. Industrially relevant enzyme categories analyzed in this study, including sequences from public databases and newly identified protist-derived enzymes.
Enzyme NameEC NumberAlternative Enzyme NameIndustrial ApplicationsNumbers of Protein Sequences Found in UniProtKB/Swiss-ProtNumbers of Protein Found in Present Study/Strain Included
cellulaseEC 3.2.1.4avicelase
beta-1,4-endoglucanhydrolase
beta-1,4-glucanase
carboxymethylcellulase
celludextrinase
endo-1,4-beta-D-glucanase
endo-1,4-beta-D-glucanohydrolase
endoglucanase
Cellulose degradation; essential enzyme in the paper industry, used to improve the brightness and strength of paper.Bacteria: 92
Amoebozoa: 1
Fungi: 54
Metazoa: 3
Viridiplantae: 54
Total: 204
Archaeplastida
Chlorophyceae: 122/4

Cryptista
Cryptophyceae: 5/1

Haptista
Haptophyceae: 263/1

TSAR
Chrysophyceae: 58/2
Bacillariophyceae: +G2: G15159/5
Dinophyceae: 73/1
Eustigmatophyceae: 79/1
EC 3.2.1.21beta-glucosidase
beta-D-glucosideglucohydrolase
cellobiase
Bacteria: 23
Amoebozoa: 1
Fungi: 107
Metazoa: 3
Viridiplantae: 86
Total: 220
EC 3.2.1.74glucan1,4-beta-glucosidase
1,4-beta-D-glucanglucohydrolase
exo-1,4-beta-D-glucosidase
exo-1,4-beta-glucanase
exo-1,4-beta-glucosidase
N.A.
EC 3.2.1.91cellulose1,4-beta-cellobiosidase
1,4-beta-cellobiohydrolase
4-beta-D-glucancellobiohydrolase
avicelase
exo-1,4-beta-D-glucanase
exocellobiohydrolase
exoglucanase
Bacteria: 9
Fungi: 49
Metazoa: 1
Total: 59
EC 3.2.1.176cellulose1,4-beta-cellobiosidase
cellulaseSS
endoglucanaseSS
Bacteria: 2
EC 3.2.1.203carboxymethylcellulase
CMCase
N.A.
XylanaseEC 3.2.1.8endo-1,4-beta-xylanaseHemicellulose degradation; promotes the release of lignin from pulp during bleaching.N.A.Archaeplastida
Chlorophyceae: 44/4

Cryptista
Cryptophyceae: 0

Haptista
Haptophyceae: 84/1

TSAR
Chrysophyceae: 20/2
Bacillariophyceae: 82/5
Dinophyceae: 23/1
Eustigmatophyceae: 15/1
EC 3.2.1.37xylan 1,4-beta-xylosidase
1,4-beta-D-xylan xylohydrolase
beta-xylosidase
exo-1,4-beta-xylosidase
xylobiase
Bacteria: 17
Fungi: 24
Viridiplantae: 4
Total: 45
EC 3.2.1.156oligosaccharide reducing-end xylanase
reducing end xylose-releasing exo-oligoxylanase
Bacteria: 3
Total: 3
PectinaseEC 3.2.1.15endo-polygalacturonase
pectin depolymerase
polygalacturonase
Enhancement of pulp bleaching; Degumming; Treats pectic substances in wastewater, contributing to reductions in COD and BOD.Bacteria: 5
Fungi: 59
Metazoa: 2
Viridiplantae: 23
Total: 89
Archaeplastida
Chlorophyceae: 3/2

Cryptista
Cryptophyceae: 0

Haptista
Haptophyceae: 10/1

TSAR
Chrysophyceae: 15/2
Bacillariophyceae: 17/5
Dinophyceae: 11/1
Eustigmatophyceae: 0
MannanaseEC 3.2.1.25beta-mannosidase
mannase
Enhancement of pulp bleaching
Mannan degradation improves physical properties such as pulp drainability and beatability (fiber fibrillation).
Fungi: 21
Metazoa: 6
Total: 27
Archaeplastida
Chlorophyceae: 44/4

Cryptista
Cryptophyceae: 0

Haptista
Haptophyceae: 36/1

TSAR
Chrysophyceae: 11/2
Bacillariophyceae: 56/5
Dinophyceae: 4/1
Eustigmatophyceae: 0
EC 3.2.1.78mannan endo-1,4-beta-mannosidase
beta-mannanase
endo-1,4-mannanase
Bacteria: 10
Fungi: 27
Metazoa: 2
Viridiplantae: 22
Total: 61
CatalaseEC 1.11.1.6/Neutralizing residual hydrogen peroxide (H2O2), a common oxidizing agent used in bleaching and etching processesBacteria: 66
Amoebozoa: 2
Fungi: 21
Metazoa: 15
TSAR: 1
Viridiplantae: 44
Total: 150
Archaeplastida
Chlorophyceae: 40/4

Cryptista
Cryptophyceae: 4/1

Haptista
Haptophyceae: 61/1

TSAR
Chrysophyceae: 21/2
Bacillariophyceae: 105/5
Dinophyceae: 10/1
Eustigmatophyceae: 13/1
LipaseEC 3.1.1.3triacylglycerol lipase
tributyrase
triglyceride lipase
Acts on triglycerides in pitch
Reduces sticky deposits.
Bacteria: 44
Fungi: 92
Metazoa: 65
Viridiplantae: 5
Total: 206
Archaeplastida
Chlorophyceae: 715/4

Cryptista
Cryptophyceae: 48/1

Haptista
Haptophyceae: 438/1

TSAR
Chrysophyceae: 339/2
Bacillariophyceae: 835/5
Dinophyceae: 319/1
Eustigmatophyceae: 172/1
LaccaseEC 1.10.3.2Urushiol oxidaseUsed in bleaching, especially for recycled paper.Archaea: 1
Bacteria: 2
Fungi: 46
Viridiplantae: 45
Total: 84
Archaeplastida
Chlorophyceae: 26/4

Cryptista
Cryptophyceae: 0

Haptista
Haptophyceae: 0

TSAR
Chrysophyceae: 5/2
Bacillariophyceae: 11/3
Dinophyceae: 1/1
Eustigmatophyceae: 1/1
EsteraseEC 3.1.1.11Pectinesterase
pectin demethoxylase
pectin methoxylase
pectin methylesterase
Reducing pitchBacteria: 6
Fungi: 9
Metazoa: 1
Viridiplantae: 88
Total: 104
Archaeplastida
Chlorophyceae: 68/4

Cryptista
Cryptophyceae: 3/1

Haptista
Haptophyceae: 49/1

TSAR
Chrysophyceae: 19/2
Bacillariophyceae: 88/5
Dinophyceae: 59/1
Eustigmatophyceae: 28/1
EC 3.1.1.13sterol esterase
Cholesterol esterase
cholesterol ester synthase
triterpenol esterase
Amoebozoa: 1
Fungi: 3
Metazoa: 20
Total: 24
AmylaseEC 3.2.1.1alpha-amylase
glycogenase
Starch with reduced viscosity via amylase treatment is used as a paper surface coating agent or internal fillerArchaea: 5
Bacteria: 36
Fungi: 15
Metazoa: 54
Viridiplantae: 27
Total: 137
Archaeplastida
Chlorophyceae: 76/4

Cryptista
Cryptophyceae: 1/1

Haptista
Haptophyceae: 20/1

TSAR
Chrysophyceae: 5/2
Bacillariophyceae: 7/3
Dinophyceae: 38/1
Eustigmatophyceae: 10/1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tanimura, M.W.L.; Kayama, M.; Matsuoka, K. Evolutionarily Distinct Enzymes Uncovered Through Sequence Similarity Network Analysis of De Novo Transcriptomes from Underexplored Protist Axenic Cultures. Fermentation 2026, 12, 71. https://doi.org/10.3390/fermentation12020071

AMA Style

Tanimura MWL, Kayama M, Matsuoka K. Evolutionarily Distinct Enzymes Uncovered Through Sequence Similarity Network Analysis of De Novo Transcriptomes from Underexplored Protist Axenic Cultures. Fermentation. 2026; 12(2):71. https://doi.org/10.3390/fermentation12020071

Chicago/Turabian Style

Tanimura, Manabu W. L., Motoki Kayama, and Kazumi Matsuoka. 2026. "Evolutionarily Distinct Enzymes Uncovered Through Sequence Similarity Network Analysis of De Novo Transcriptomes from Underexplored Protist Axenic Cultures" Fermentation 12, no. 2: 71. https://doi.org/10.3390/fermentation12020071

APA Style

Tanimura, M. W. L., Kayama, M., & Matsuoka, K. (2026). Evolutionarily Distinct Enzymes Uncovered Through Sequence Similarity Network Analysis of De Novo Transcriptomes from Underexplored Protist Axenic Cultures. Fermentation, 12(2), 71. https://doi.org/10.3390/fermentation12020071

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop