5.1. Pre-Training Dataset Collection
We first collected a dataset comprising 21,498 microorganisms and their corresponding optimal growth temperatures (OGTs) from the literature [
41]. We queried the UniRef100 [
42] database (release 2023_05) using the
TaxID field in the UniRef FASTA description to retrieve all protein sequences associated with each organism (the mapping script is available at
https://github.com/ginnm/ThermoFormer/blob/main/build_ogt_dataset.py (accessed on 19 March 2026)). Each protein was then annotated with the OGT of its host organism. When multiple OGT values were reported for the same taxon ID, we adopted the mean value. In our dataset, 1080 out of 18,534 organisms (∼5.8%) had conflicting OGT annotations from multiple sources; the mean absolute difference among conflicting entries was 0.98 °C. Because the UniRef100 database clusters sequences at 100% sequence identity, each unique protein maps to exactly one representative taxon ID, so OGT conflicts arise only at the organism level. We applied the following filtering pipeline: sequences containing non-standard amino acid residues (B, J, O, U, X, Z) were removed (490,316 sequences); sequences longer than 2048 residues or shorter than 32 residues were also excluded (410,699 and 39,412 sequences, respectively). Exact-duplicate sequences are not present in UniRef100 by construction. After filtering, the final OGT dataset contains 96,017,137 sequences from 14,612 organisms.
We split the pre-training dataset into a validation set, a mix-species test set, and a cross-species test set. The split is performed at the species level using NCBI Taxonomy IDs. The validation set and the cross-species test set each contain 100 organism species, identified by unique NCBI Taxonomy IDs, that are entirely absent from the training set. We verified that no organism appears under multiple taxon IDs by cross-referencing the NCBI Taxonomy database. The complete lists of taxon IDs for the 100 validation-set and 100 cross-test organisms are published in our GitHub repository (
https://github.com/ginnm/ThermoFormer (accessed on 19 March 2026)) for full reproducibility. The mixed-species test set contains 500,000 sequences randomly drawn from the training organisms. A data leakage analysis using MMseqs2, including sequence identity statistics between downstream test sets (TM-Cell, TM-Atlas, OCT) and the pre-training corpus, is provided in
Supplementary Note S1. Genus-level generalization analysis and a discussion of OGT label noise are provided in
Supplementary Notes S4 and S5, respectively.
The statistical information of our dataset and splits is shown in
Table 11.
5.2. Model Architecture and Pre-Training
ThermoFormer is a pre-trained Transformer model. It contains four components: a transformer-based encoder for extracting residue-level representations, an attention-based pooling layer for aggregating the residue-level representation into a sequence-level representation, a sequence decoder for MLM pre-training, and a predictor for OGT prediction.
Design Rationale. While Transformer-based architectures and masked language modeling are established techniques in protein language modeling [
21,
23,
28], ThermoFormer’s key novelty lies in the integration of large-scale supervised OGT pre-training with unsupervised MLM at an unprecedented scale of 96 million sequences. We adopt an MLM-based encoder rather than causal language modeling (CLM) because MLM produces bidirectional representations that capture the global context of protein sequences, which is essential for predicting properties determined by the entire sequence [
20,
30]. The combination of supervised and unsupervised objectives is designed to incorporate temperature-aware information into the learned representations while preserving the MLM objective’s sequence-understanding capability. We use an attention-based pooling mechanism rather than simpler alternatives (e.g., mean pooling or [CLS] token) because temperature-related information may be unevenly distributed across residue positions, and the attention mechanism can learn to selectively weight the most informative positions. These design choices are validated by our ablation studies in
Section 4.4.
These components are detailed below:
Transformer-based encoder. The Transformer-based encoder [
19] encodes the protein sequences into a sequence of contextual hidden states. Let
denote a protein sequence, where
is the one-hot encoding of the
residue,
L is the length of the protein and
V is the residue vocab size. The encoder first maps each residue into a dense vector through a learnable token embedding matrix:
where
is the token embedding matrix and
d is the hidden dimension. Instead of additive positional embeddings, we adopt Rotary Position Embedding (RoPE) [
43], which encodes positional information by rotating the query and key vectors in the self-attention mechanism. Specifically, for position
i, the rotation matrix
is defined as follows:
where
=
for
. The rotation is applied to the query and key vectors before computing the attention scores, enabling the attention to be a function of relative positions between residues rather than absolute positions. This property is particularly beneficial for protein sequences, as the functional relevance of residue interactions often depends on their relative spacing in the sequence rather than their absolute positions.
The embedded sequence
is then processed through
N stacked Transformer layers. Each Transformer layer
l consists of a multi-head self-attention (MHSA) sub-layer followed by a position-wise feed-forward network (FFN) sub-layer, each equipped with residual connections and layer normalization [
44]:
where
and
denotes the layer normalization function.
In the multi-head self-attention sub-layer, the input is projected into
K parallel attention heads. For the
k-th head, the query, key, and value vectors are computed as follows:
where
are the projection matrices and
is the dimension per head. The RoPE rotation is then applied to the query and key vectors before computing the attention scores:
This ensures that the dot product
depends only on the relative position
, since
. The scaled dot-product attention is then computed as follows:
where
and
are matrices formed by stacking the rotated query and key vectors, respectively. The outputs from all
K heads are concatenated and linearly projected:
where
is the output projection matrix.
The position-wise feed-forward network consists of two linear transformations with a GELU [
45] activation function:
where
,
are weight matrices,
,
are bias terms,
is the GELU activation function, and
is the intermediate dimension.
After
N Transformer layers, the final contextual representations are:
where
is the contextual embedding of the
i-th residue, capturing both local amino acid identity and global sequence context. In our implementation, the encoder comprises
layers with
attention heads and a hidden dimension of
, resulting in approximately 650 million parameters. We adopt the flash attention mechanism [
40] to improve computational efficiency during both pre-training and inference.
Sequence Decoder. The sequence decoder learns to recover the masked token from the hidden states. It contains two position-wise dense layers with GELU activation unit and a layer normalization layer [
44]:
where
,
and
are learnable parameters,
is the layer normalization function and
is the GELU [
45] activation function.
is the probability distribution of the predicted
residue. And we utilize cross-entropy as the loss function:
where
represents the true residue for the
i-th token in the sequence, and
denotes the predicted probability for the correct residue.
Attention-based Pooling Layer. The attention-based pooling layer learns to aggregate the hidden states
into a global hidden state for further adaptation on a sequence-level task. The weights of hidden states
are computed by a projection-soft-max layer that produces a weighted vector
:
where
is the attention weight of the
residue and
and
are the learnable parameters of the attention pooling layer. Then, a multi-layer perceptron with two dense layers and GELU activation is employed to transform the weighted vector
. The first dense layer maps
to the same dimension as the Feed-Forward Network (FFN) layer of the Transformer encoder, which in our implementation is four times the size of the hidden layer. The second dense layer maps the output of the first layer back to the original dimension. Between the first and second dense layers, a GELU activation function is applied. Additionally, there is a residual connection between the output of the second dense layer and the output of the attention layer:
where
and
are learnable parameters layers,
is the GELU activation function. The output hidden state
is the representation of the whole sequence.
Predictor. The predictor learns to predict a temperature value
from the sequence representation
. It has two dense layers and a
activation function:
where
and
are learnable parameters,
is the Thanh activation function, and
is the predicted temperature. We utilize the mean square error (MSE) criterion as the loss function:
where
denotes the expectation,
is the predicted temperature, and
T is the ground truth temperature.
Joint Loss Function. The pre-training loss function is the sum of
and
. Since we have observed that
has a significantly different magnitude compared to
, with values ranging from 0 to 1000 initially and stabilizing at 0 to 100 later. We multiplied
by 0.01 to maintain numerical stability. The final joint loss function is: