3.1. 2s-DAS Framework
Pipeline. In this work, we propose the 2s-DAS framework to address the temporal action segmentation task. Our 2s-DAS employs two diffusion models with encoder-decoder structures, each receiving RGB and FLOW feature sequences, respectively. Given pre-extracted frame-by-frame video feature sequences
and
, of size
, where
T is the length of the video sequence and
d is the dimension of the input features, the feature sequence is input into the encoder to extract higher-level feature representations. Subsequently, the encoder outputs a predicted label sequence
of size
, where
B is the batch size,
C is the number of action categories, and
T remains the length of the video sequence.
is used to compute the loss function for the encoder, updating the learning parameters within the encoder. Additionally, the encoder generates a higher-level feature representation
, which is input into the decoder as conditional information combined with noise data to help guide the decoder in recovering the original action label sequence from the noise. The encoder also provides feature map mappings from different layers as additional inputs to the decoder, enhancing its understanding and modeling capabilities of the video content. The decoder applies different importance sampling strategies (
Section 3.2) to the input features, and its inputs also include the diffusion step
and the action label sequence
with added noise. The decoder will denoise
based on this input information to generate the denoised sequence
at the previous time step
. Through multiple iterations of this process, the prediction result
P that is close to the original action label sequence can ultimately be obtained. Finally, the prediction information
from the two streams is fused through a weighted combination to achieve late fusion. The fusion formula is as follows:
The fused prediction result is denoted as
, where
and
correspond to the output results of the RGB stream and the optical flow stream, respectively.
and
are the weights for each stream, with values ranging between 0 and 1. Depending on the dataset, we set these weight parameters to different values. It is important to emphasize that our choice of a weighted late-fusion strategy over more complex early or mid-fusion fusion modules is fundamentally structural. Since RGB and optical flow possess inherently different feature distributions and temporal dynamics, coupling them early in the iterative diffusion denoising process causes gradient interference and degrades the learning of modality-specific priors. As validated in our ablation studies in
Section 4.3, allowing two independent diffusion streams to fully recover their sequence priors before a late fusion yields the most robust multi-modal synergy. Finally, post-processing strategies such as purge, median filtering, or mode filtering are applied to the fused prediction results to further refine the outcomes.
Feature extraction. We utilize the pre-trained I3D [
15] model to extract I3D FLOW features, where the feature dimension
d is set to 1024. For RGB feature extraction on the GTEA and 50Salads datasets, we replace the standard I3D RGB features with those generated by the pre-trained image encoder from the Br-Prompt [
14] framework. Regarding the prompting details, the original Br-Prompt framework aligns video clips with a unique “three-plus-one-level” textual prompt structure via contrastive learning. This structure dynamically generates text templates integrating statistical, ordinal, semantic, and integrated prompts to capture temporal logic. It is crucial to note that in our 2s-DAS framework, we do not manually design or fine-tune these textual prompts during our training phase. Instead, we directly extract the 768-dimensional (
) frame-wise RGB features generated by the frozen Br-Prompt vision encoder. By doing so, our extracted features inherently inherit the rich ordinal and semantic contextual awareness learned during Br-Prompt’s pretraining, without requiring any additional prompting overhead in our pipeline.
3.2. Diffusion Module
When applied to action segmentation tasks, diffusion models primarily consist of two phases: the forward process and the reverse process. Since both the RGB and optical flow streams employ an identical diffusion architecture, we introduce the detailed mathematical formulation by taking a single stream as a generic example and omitting the stream-specific subscripts for clarity. To effectively bridge the gap between the continuous diffusion paradigm and the discrete categorical label space, we first represent the action label sequence in a continuous logit space
via one-hot encoding, where
C is the number of action categories and
T is the sequence length. This ensures that the model can perform iterative refinement within a stable probability space. Under this formulation, the forward process involves gradually adding Gaussian noise to the continuous representation
, making it increasingly random until it becomes a pure noise sequence
, which almost completely loses the original action information. This diffusion process can be described by Equation (
2):
where
is the original action sequence label,
represents random noise used to simulate the uncertainty and ambiguity in the action sequence at step
n, and
controls the degree of noise added at each step. The reverse process primarily starts from the pure noise action sequence
and recovers the original action label
by gradually removing noise. The denoising process at each step
n can be described by Equation (
3):
where
is the denoised sequence representing the estimated original sequence
, which is obtained by inputting the noisy sequence
into the decoder at step
n. The variance parameter
controls the stochasticity of the reverse trajectory. By iteratively applying Equation (
3) until
is obtained, an argmax operation is finally applied along the class dimension to map the continuous logits back into discrete categorical action labels.
During the training phase, we randomly select a diffusion step
for each batch. We apply Gaussian noise
to the ground-truth label sequence
, resulting in the noisy input
. The decoder is then trained to estimate the original sequence
from this corrupted version. In the inference phase, the model starts with a fully noisy sequence
. It iteratively refines this sequence using the DDIM denoising procedure. This process progressively generates cleaner outputs until the final prediction
P is obtained. The importance sampling strategy is utilized exclusively during the training phase. By selectively masking input features, the framework is incentivized to capture the prior distribution of human actions and maintain structural consistency under information constraints. During inference, this strategy is deactivated to allow the model to leverage unabridged input features, thereby achieving superior boundary refinement and segmentation accuracy. Therefore, it can denoise the pure noise sequence
based solely on the features
F generated by the encoder. The training procedure of our model is as follows:
where
is the estimated original sequence at step
n,
denotes the decoder,
represents the encoder, ⊙ indicates element-wise multiplication, and
is the mask generated by the sampling strategy in the importance sampling module.
Encoder. The diffusion module consists of an encoder, a decoder, and an importance sampling block. The encoder is composed of
L encoder blocks, each built around a dilated convolution, a dilated attention mechanism, and a feed-forward layer, adapted from ASFormer [
6]. The input features are first processed through the dilated convolution to extract local features, followed by the dilated attention to model the global context of the features. Specifically, in the dilated attention, the receptive field of the
i-th block is restricted to a local window of size
, achieving progressive temporal modeling from local to global by dynamically adjusting the dilation rate. After computation in each block, the output is added to the input features via a residual connection before being passed to the next encoder block. Subsequently, the features are fed into the feed-forward network to generate new feature representations, which are then mapped into the class space to produce the initial action segmentation prediction.
Decoder. The decoder shares similar convolutional layers and attention mechanisms with the encoder. The only difference is that it requires logical transformations of the time steps. It also employs cross-attention mechanisms to compute the influence of attention weights among multiple inputs within the attention module. The decoder receives the current features and the noisy action sequence, and iteratively generates a precise action sequence via the reverse denoising process defined in Equation (
3).
Importance sampling. Before feeding the refined features
into the decoder, we use an importance sampling strategy to filter the information that needs to be passed to the decoder, thereby guiding the model to learn the action prior distribution and predict results. During the training phase, we follow the method in [
9] and randomly select among the following three strategies: full sampling, boundary sampling, and random action sampling. Zero sampling is excluded because it relies solely on prior modeling—predicting based on the initial positions of video frames, which cannot dynamically adjust actions during decoding and does not align with our video action segmentation task goals. Specifically:
(a) Full sampling. Inputs all features into the decoder, helping the model directly learn a discriminative mapping from features to action classes.
(b) Zero sampling. Does not pass any features, directly using the initial action positions for prediction (not applicable to action segmentation tasks).
(c) Boundary sampling. Removes features near action boundaries, forcing the model to rely on contextual information to refine boundary predictions.
(d) Random action sampling. Randomly removes features of a specific action class, enabling the model to infer missing regions through contextual information from other action classes.The design of our importance sampling module is rooted in the need to provide the decoder with diverse contextual perturbations, thereby compelling the model to learn a robust generative prior of action sequences. While specific strategies like boundary sampling emphasize the refinement of action transitions, relying exclusively on a single structured sampling method can lead to localized over-fitting. We propose a Random Selection strategy that dynamically chooses among the defined sampling methods during each training iteration. This approach serves as a form of stochastic regularization (analogous to Dropout), preventing the model from becoming overly dependent on specific temporal cues. By encountering a wide variety of masked and unmasked feature combinations, the decoder is forced to model the global temporal logic and holistic action patterns. As demonstrated in our ablation study (in
Section 4.3), this randomized strategy yields superior performance over any individual structured sampling method, effectively enhancing the model’s ability to handle both boundary ambiguity and long-range sequence alignment.
3.3. Loss Function
We compare the denoised prediction sequence
P obtained from the decoder with the original action value labels
to calculate the corresponding loss values. The loss function in this paper consists of three parts: cross-entropy classification loss
, smoothing loss
[
5], and boundary alignment loss
.
The cross-entropy classification loss
uses the minimization of negative log-likelihood loss, and its formula is shown in Equation (
5):
where
t denotes the video sequence time step index,
c denotes the action class,
represents the original ground-truth label of the
t-th frame for class
c, and
is the corresponding predicted probability obtained by applying a Softmax function over the predicted logits. The smoothing loss
promotes the temporal smoothness of the model’s output by calculating the mean squared error of the frame-by-frame probabilities, and its formula is shown in Equation (
6):
The boundary alignment loss
ensures that the boundaries detected in the denoised sequence
align with the boundaries in the true sequence
. We use a binary cross-entropy loss to calculate the boundary alignment loss between the ground-truth boundary sequence
derived from
and the predicted boundary probabilities
derived from the denoised sequence
. The final loss function is composed as shown in Equation (
7):
For 2s-DAS, the loss function for each stream is composed of
, which we define as the RGB stream loss function
and the optical flow stream loss function
. To allow the model to simultaneously optimize the performance of both streams during training, we connect the loss functions of the two streams. This enables different streams to learn from each other through gradient propagation and relatively reduces the complexity of training, improving the stability and convergence rate. Therefore, the total loss function of the model
is as follows:
3.4. Implementation and Training Details
The overall structure of the 2s-DAS model includes two encoders and two decoders. Both the encoders and decoders are improved based on the ASFormer [
6] model, and by reducing iterative calculations, the decoders in this paper significantly reduce computational costs. Each encoder consists of 10 encoder blocks, with an input dimension of 64 for the 50Salads and GTEA datasets, and 256 for the Breakfast dataset. Each decoder includes 8 decoder blocks, with dimensions of 24 for the 50Salads and GTEA datasets, and 128 for the Breakfast dataset. The convolution kernel size for the GTEA and Breakfast datasets is 5, while for the 50Salads dataset, the convolution kernel size in the decoder is set to 7. Regarding the training protocols, both the Br-Prompt and I3D feature extractors are pre-trained on Kinetics-400 and kept strictly frozen during the training of our framework. The spatial resolution of the input video frames is set to 224 × 224, and features are extracted offline to ensure computational efficiency. For the optimization process, we employ the Adam optimizer with a uniform batch size of 4 across all experiments. For the 50Salads and GTEA datasets, the learning rate is 0.0005, while for the Breakfast dataset, it is 0.0001. The total number of training epochs is set to 10,000 for the GTEA dataset, 5000 for 50Salads, and 1000 for the Breakfast dataset. The fusion weights
and
are determined through grid search on the validation set of each fold. We implemented the 2s-DAS structure using PyTorch 1.10, and all experiments were conducted on NVIDIA Tesla V100. Similar to [
9], the total number of diffusion steps is set to 1000. During inference, we adopt the accelerated sampling strategy with a step size of 25 [
30]. The multi-modal late-fusion weights and other hyperparameters are determined through an experimental search exclusively on the training folds. No information from the test folds is utilized during the model selection or hyperparameter tuning process.