Multi-Task Seq2Seq Framework for Highway Incident Duration Prediction Incorporating Response Steps and Time Offsets

Fan, Fengze; Hao, Jianuo; Fu, Xin

doi:10.3390/vehicles8010005

Open AccessArticle

Multi-Task Seq2Seq Framework for Highway Incident Duration Prediction Incorporating Response Steps and Time Offsets

by

Fengze Fan

¹,

Jianuo Hao

¹ and

Xin Fu

^1,2,*

¹

College of Transportation Engineering, Chang’an University, Xi’an 710061, China

²

Ministry of Education Engineering Center for Transportation Infrastructure Digitization, Chang’an University, Xi’an 710061, China

^*

Author to whom correspondence should be addressed.

Vehicles 2026, 8(1), 5; https://doi.org/10.3390/vehicles8010005

Submission received: 13 December 2025 / Revised: 26 December 2025 / Accepted: 28 December 2025 / Published: 2 January 2026

Download

Browse Figures

Versions Notes

Abstract

Highway traffic incident management is a dynamic and time-dependent process, and rapidly and accurately predicting its complete sequence of actions and corresponding time schedule is essential for improving the refinement and intelligence of traffic control systems. To address the limitations of existing studies that predominantly focus on predicting the total duration while lacking fine-grained modeling of the response procedure, this study proposed a multi-task sequence-to-sequence (Seq2Seq) framework based on a BERT encoder and Transformer decoder to jointly predict incident response steps and their associated time offsets. The model first leveraged a pretrained BERT to encode the incident type and alarm description text, followed by an autoregressive Transformer decoder that generated a sequence of response actions. An action-aware temporal prediction module was incorporated to predict the time offset of each step in parallel, and an adaptive weighted multitask loss was adopted to optimize both action classification and time regression tasks. Experiments based on 4128 real records of highway incident handling in Yunnan Province demonstrated that the proposed model achieved improved performance in duration prediction, outperforming baseline approaches in RMSE (18.05), MAE (14.69), MAPE (37.13%), MedAE (13.23), and SMAPE (33.55%). In addition, the model attained BLEU-4 and ROUGE-L scores of 62.33% and 82.04% in procedure text generation, which confirmed its capability to effectively learn procedural logic and temporal patterns from textual data and offered an interpretable decision-support approach for traffic incident duration prediction. The findings of this study could further support intelligent traffic management systems by enhancing incident response planning, real-time control strategies, and resource allocation for expressway operations.

Keywords:

traffic incident duration; text mining; sequence-to-sequence learning; Transformer; multi-task learning

1. Introduction

Highway traffic incidents constitute a major factor affecting traffic efficiency and triggering safety risks on expressways. Their occurrence and subsequent management often lead to partial or complete blockage of lane resources, significantly reducing sectional capacity and disturbing local traffic flow conditions [1,2]. The duration of an incident is a key variable determining the scale and severity of its impact. As the response process prolongs, the traffic delay induced by an incident grows nonlinearly, potentially causing large-scale congestion, regional paralysis, and considerable economic and safety consequences [2,3]. Beyond temporal indicators, road traffic incident maps can also help contextualize incident impacts by visualizing where disruptions concentrate, which complements duration-oriented analyses [4]. Accurate prediction of incident duration has therefore become a central task in intelligent transportation systems and emergency response. Reliable estimates allow traffic management agencies to rapidly formulate scientific and pragmatic control strategies, improving network operation efficiency and reducing disruption [2,5].

From a problem-definition perspective, traffic incident duration prediction can be organized by the operational target that the prediction is intended to support: (i) one-shot estimation of the total incident duration for early-stage planning, (ii) dynamic forecasting of the remaining duration with continuous updates as the incident evolves and new information arrives, and (iii) phase-/process-level prediction for decision support so that outputs align with operational actions and management stages. Methodologically, these targets have been approached using statistical/survival modeling and data-driven machine learning/deep learning, with recent studies increasingly emphasizing dynamic updating and process-aware (hybrid) traffic incident management frameworks. Systematic surveys and reviews have summarized the evolution of these paradigms, their assumptions, and their practical trade-offs [2,6].

For one-shot estimation of the total incident duration, prior studies have predominantly relied on structured operational data and can be broadly divided into regression/classification approaches and survival (hazard-based) modeling. Regression and classification methods treat duration as a continuous variable or discretized time intervals, enabling flexible learning of nonlinear relationships among incident attributes, traffic states, and environmental conditions. Representative examples include a graph convolutional network that models spatial interactions among sensors for short-/long-duration classification [7], an integrated classification–regression framework that addresses data deficiency via clustering and relabeling [8], and an attention-based long short-term memory (LSTM) that captures temporal dependencies and feature relevance for duration prediction [9], as well as related deep learning predictors developed for highway traffic accidents [5]. In addition, classic tree-based models have been used to derive decision rules for incident duration prediction, providing interpretable relationships without strong parametric assumptions [10]. In parallel, survival analysis provides an interpretable statistical foundation for right-skewed duration distributions and (potentially) censored or interval-censored data. For instance, Hojati et al. established accelerated failure time (AFT) models with heterogeneity adjustments for three incident categories [11]. Tirtha et al. employed a copula-based framework to jointly model the relationship between incident types and durations through SMNL and GGOL models [12]. Gu et al. formulated a geographically weighted proportional hazards model to estimate the total impact time including recovery period [13]. Although these one-shot paradigms are widely adopted and often effective for early-stage planning, their predictive performance and practical utility can be constrained by the inherent limitations of structured data in representing incident evolution and operational context.

Beyond one-shot estimation, dynamic forecasting focuses on predicting the remaining duration and updating it as new information becomes available during incident evolution. This target is particularly relevant for real-time traffic operations because incident progression, response actions, and clearance conditions may change substantially after the initial report. Dynamic approaches typically incorporate time-varying covariates (e.g., evolving traffic flow states, updated incident status, and intermediate response records) and repeatedly estimate the residual time-to-clearance at successive update moments. Zhu et al. proposed a deep learning framework (LSTM/MLP) to dynamically predict residual incident duration on urban expressways using incident-related factors and real-time traffic flow parameters [14]. Related studies on online or real-time post-impact prediction also indicate the value of continuously updated spatiotemporal modeling under streaming operational data [15,16]. More recent work on dynamic event-duration forecasting further supports the feasibility and necessity of real-time data integration and updating mechanisms [17]. However, many dynamic predictors still output a single scalar remaining time at each update, which may be insufficient for fine-grained interpretability and step-wise decision support in practical incident management.

For decision support, a third stream emphasizes phase- or process-level duration prediction, aiming to align model outputs with operational stages in traffic incident management rather than treating the incident as a monolithic event. In this line, incident duration is often conceptualized as comprising multiple operational phases (e.g., detection, response, clearance, and recovery), and phase-based modeling attempts to explain or predict phase durations to support targeted dispatching and control strategies [2,18]. Empirical studies have also examined the contribution of specific sub-processes such as discovery time to overall duration formation, providing evidence that intermediate operational steps can meaningfully impact the total duration [19]. Moreover, traffic incident management (TIM) is widely recognized as a coordinated multi-disciplinary process to detect, respond to, and clear incidents in a safe and timely manner, motivating hybrid management frameworks that couple multiple operational stages for actionable decision support [20]. Despite these practical motivations, existing phase-/process-oriented studies commonly remain at coarse granularity (macro phases) and rarely exploit unstructured textual narratives as a primary signal for generating actionable, fine-grained process outputs.

In real-world traffic operations, critical information—including incident evolution, response decisions, and status changes—is often recorded as unstructured natural language text, which structured features fail to represent comprehensively and promptly [21]. Owing to the flexibility and richness of natural language, recent research has begun to explore the value of text-based information in duration prediction. Ji et al. developed a V-Fisher clustering-based ensemble approach to address text heterogeneity for classification of duration [21]. Chen et al. constructed a Word2Vec–BiLSTM–CNN multimodal network for regression prediction of unforeseen incident duration based on textual descriptions [22].

Despite these advances, two significant challenges remain. First, existing studies lack fine-grained process-level modeling. Most research treats the final duration as a single regression target or only divides the procedure into a few predefined macro stages [15,16,19], which are too coarse to capture specific response steps and their corresponding time offsets. This constraint limits the practical interpretability and decision-support value for traffic management. Second, current text-based approaches often separate semantic extraction and duration regression as independent processes, hindering deep joint optimization between semantic understanding and temporal prediction.

To address these limitations, this study proposes a BERT encoder–Transformer decoder framework that leverages the dynamic semantic representation capability of BERT and the sequence modeling advantages of Transformer. Conceptually, our approach targets text-driven, fine-grained process-level modeling: it generates a sequence of standardized response steps and simultaneously estimates step-specific time offsets, enabling a unified view of “what happened” and “how long each step took” for operational decision support. A response-step-aware time prediction module is designed to enable cross-task interaction based on action indices and collaborative modeling of semantic information and time offsets. A multi-task learning mechanism is further introduced to jointly optimize response step prediction and time regression. The main contributions of this work are summarized as follows:

(1): An end-to-end multi-task Seq2Seq model is proposed to jointly predict response steps and time offsets for multiple categories of highway incidents.
(2): A pretrained BERT model is utilized to deeply encode incident report text, enabling semantic transfer and enhancement.
(3): A Transformer decoder-based multi-task architecture is designed to collaboratively optimize action prediction and time regression, improving incident duration prediction performance.

The remainder of this paper is organized as follows. Section 2 introduces the dataset, preprocessing methods, and the proposed multi-task Seq2Seq model. Section 3 presents the experimental setup, performance evaluation, and discussion of the results. Section 4 summarizes the conclusions and highlights the main findings, while Section 5 discusses limitations and Section 6 outlines directions for future research.

2. Materials and Methods

2.1. Data Description

The raw dataset used in this study was obtained from the internal emergency incident-handling records of the Yunnan Provincial Transportation Investment and Construction Group. The data span covers the period from 1 January 2024, to 19 August 2024, and include multiple categories of highway traffic incidents such as traffic accidents, congestion events, and mechanical vehicle failures. Each record contains three textual components: incident type (subtypename), incident alert description (alertdescribe), and incident response report (ownerreport). Examples of the original unstructured textual records are shown in Table 1.

Highway traffic incidents are typically categorized into four phases: incident detection, response, clearance, and traffic restoration [1,23]. The duration of an incident in this study is defined as the total time interval from detection to full restoration of normal traffic flow, which reflects the overall impact of the incident on traffic operations and serves as a key indicator of emergency response efficiency [24].

Traffic incident duration, however, is not a static numerical value; rather, it evolves as a dynamic sequential process composed of a series of ordered response steps [24]. As illustrated by the examples in Table 1, event progression is neither random nor uncorrelated; it is driven by semantic cues embedded in the alert text, such as temporal attributes (“20:30”), spatial context (“tunnel”), incident type (“truck rollover”), and severity descriptors (“occupying overtaking lane”). These contextual elements constitute the core observable variables that guide prediction tasks in this study.

From the initial alert to final traffic restoration, an incident typically goes through multiple response steps, including reporting, department coordination, rescue deployment, on-site handling, and road reopening. The time consumed by each step contributes to the final duration, forming the fine-grained internal temporal structure of incident management. In this regard, total duration represents a macro-level outcome, whereas step-wise duration reflects its micro-level procedural composition, which is the focus of this study.

During real-world data collection, substantial variations exist in the frequency distribution across incident types, and the naturally recorded text may contain noise in the form of incomplete descriptions, typographical errors, and missing fields, posing risks to model robustness and predictive reliability [25]. To mitigate these issues, this work focuses on the three most frequently occurring incident types—traffic accidents, mechanical failures, and congestion events—which collectively account for approximately 87% of all recorded cases.

In addition, to reduce the influence of extreme outliers and concentrate on scenarios with stronger representativeness, events with durations between 10 and 120 min were retained for analysis. This subset covers approximately 75% of the original dataset, providing a more concentrated distribution conducive to learning dominant temporal patterns.

After applying data cleaning and filtering procedures, a total of 4128 valid incident samples were retained, including 3072 traffic accidents (74.42%), 574 mechanical failures (13.91%), and 482 congestion events (11.68%). This dataset reflects the typical characteristics of incident occurrences in the studied highway network. The distribution of incident durations is illustrated in Figure 1.

2.2. Text Data Preprocessing

The process of traffic incident detection and response follows an explicit temporal sequence, thus requiring the establishment of a unified time reference. Regular expressions were applied to extract time information separately from the incident alert descriptions and response reports [26]. Specifically, the timestamp contained in the alert description is defined as the incident discovery time and is used as the baseline. In our raw records, time points are recorded in Chinese in the form “HH [hour-marker] MM [minute-marker]” (equivalent to “HH:MM”). Therefore, we extracted hour–minute pairs using a rule-based regular-expression matcher tailored to this format. Samples without a valid baseline time extracted from incident alert description were removed from subsequent processing.

Based on the same time pattern, the response report is further converted into a chronological sequence of “time–action” pairs, as illustrated in Figure 2. We first identify all explicit time expressions in response report (i.e., timestamps in the “HH [hour-marker] MM [minute-marker]” format; hereafter referred to as time anchors) together with their character positions. The report is then segmented by adjacent time anchors: for each time anchor, the action text is defined as the substring between the end of the current time anchor and the start of the next one (or the end of the report for the last anchor). If there exists descriptive text before the first time anchor, it is preserved as an initial step with a time offset of 0 min to avoid losing leading narrative information. Each time anchor is converted into an absolute time object (using a fixed dummy date to compute differences), and the time offset (in minutes) is calculated relative to the baseline discovery time. To handle potential cross-day cases caused by hour–minute recording, negative offsets are corrected by adding 1440 min (24 h). Action texts are then normalized by removing a potential leading comma and merging consecutive whitespace; empty action segments are ignored.

To further improve text quality and reduce the influence of low-frequency noise tokens on numerical representation and model training, all textual fields—including incident type, alert description, and response actions—were cleaned in a unified manner. Chinese word segmentation was performed using the Jieba tokenizer [27], followed by the construction of a global vocabulary with token frequency statistics. A minimum frequency threshold of two occurrences was applied to filter out low-frequency noise terms. After the preprocessing steps, a global vocabulary consisting of 3249 tokens was generated, providing a clearer semantic space for subsequent modeling and prediction tasks. Notably, this vocabulary supports preprocessing only, whereas text encoding for model training is introduced in Section 2.3.

2.3. Multi-Task Seq2Seq Model Architecture

Different categories of highway traffic incidents impose distinct impacts on traffic operations and consequently require different response procedures. Natural language text, owing to its flexibility, is commonly used to document key information regarding incident detection and the subsequent response process [28]. Leveraging deep learning to model the mapping between incident descriptions and the corresponding response steps enables the development of predictive models capable of forecasting both the procedural sequence and its associated temporal structure. This study proposes a multi-task learning model based on a BERT encoder and Transformer decoder to capture the relationships among response steps, incident type, alert description, and the duration of each step.

2.3.1. BERT Encoder

To extract semantic cues relevant to incident response decision-making from the incident type and alert description, a pretrained BERT model is employed as the text encoder [29,30]. Compared with static word embeddings such as Word2Vec and GloVe, BERT generates contextualized representations dynamically and is therefore more effective in capturing key semantic attributes—such as event category, spatial context, temporal information, and severity level—from the alert text. These semantic cues are essential to infer the appropriate response actions and their corresponding time offsets [30,31].

As illustrated in Figure 3, BERT consists of three embedding layers (token, segment, and position embeddings), multiple Transformer encoder layers, and an output layer. The input to the model is the concatenated text sequence of the incident type and alert description. The BERT tokenizer performs subword segmentation and numerical encoding to generate Token IDs, Segment IDs, and Attention Masks. The token, segment, and positional embeddings are summed and fed into stacked encoder layers. Through bidirectional self-attention, the encoder aggregates global semantic information and outputs a sequence of contextualized hidden vectors that represent the semantic structure of the original alert text.

Compared with general natural language processing (NLP) corpora, incident-related text contains heterogeneous information—such as event morphology, spatial scene, and evolving status—that directly influence response actions and duration. The semantic vectors produced by the BERT encoder unify these elements in a continuous representation space and serve as semantic input for both decoder-based action generation and time offset prediction, enabling the model to fully exploit textual information in downstream tasks.

2.3.2. Transformer Decoder

The incident-handling procedure can be viewed as a sequence of actions occurring over time with both short- and long-range dependencies. To support autoregressive generation and align each predicted action with global semantic context, a Transformer decoder based on self-attention is adopted as the sequence generator [32,33]. The decoder models historical dependencies through masked self-attention and aligns the current generation step with the encoder’s semantic representations through cross-attention. Compared with RNN- or CNN-based architectures, the Transformer is more advantageous in parallel computation and long-range dependency modeling, making it well-suited for multi-step prediction tasks [33].

As shown in Figure 4, the decoder consists of stacked layers composed of three functional submodules:

Masked multi-head self-attention explicitly captures temporal dependencies among previously generated actions, enabling inference of subsequent steps;
Multi-head cross-attention aligns action prediction with key semantic features extracted by the BERT encoder, establishing a mapping between textual semantics and response decisions;
A position-wise feedforward network applies nonlinear transformation to extract features relevant to the current action and provide stable representations for both action embedding and time offset prediction.

Residual connections and layer normalization are applied to each sublayer to enhance training stability and convergence [34].

During action generation, the decoder receives the previously generated action token at each time step and computes the conditional probability distribution for the next action through masked attention and cross-layer alignment, until the end-of-sequence token is produced. This mechanism enables the decoder to leverage both the historical action chain and the textual semantics to generate a response sequence that aligns with realistic operational logic. The generated action representations are simultaneously used as input to the response-step-aware time prediction module, enabling collaborative modeling between action generation and time regression (see Section 2.3.3).

2.3.3. Multi-Task Learning for Action Prediction and Time Offset Regression

The proposed multi-task prediction model is built upon an encoder–decoder framework and aims to jointly predict the sequence of response steps and their corresponding time offsets. It is designed to process incident text data and, in an end-to-end fashion, output both the response procedure and the execution time of each step [35]. Unlike many existing multi-task models that design independent decoders for different tasks or perform task-specific prediction only on top of a shared encoder [36,37], the proposed model combines the semantic representation capability of a pretrained language model with the sequence generation advantages of the Transformer architecture. A response-step-aware mechanism is further introduced, where response step indices act as a bridge for cross-task information interaction, enabling the collaborative optimization of discrete step classification and continuous time regression. The overall architecture is shown in Figure 5 and consists of five core modules: a BERT tokenizer, a BERT encoder, a Transformer decoder, a response-step-aware time prediction module, and an adaptive multi-task learning module.

(1) Input Encoding and Layer-wise Fine-tuning

The model input is a natural language sequence obtained by concatenating the incident type and the alert description. A pretrained BERT tokenizer is first used to convert the text into Token IDs and attention masks, ensuring compatibility with the BERT encoder input format. Different from the Jieba-based segmentation in Section 2.2, which is only used for word frequency statistics and text cleaning, the BERT tokenizer is the only text encoding method employed during model training and inference. The tokenized sequence is then fed into the BERT encoder (see Section 2.3.1) to extract deep contextual semantic representations.

To maintain training stability under limited data and to make effective use of pretrained knowledge, a layer-wise progressive fine-tuning strategy was adopted: the parameters of the lower BERT layers were frozen, and only the last two layers were fine-tuned. Moreover, different learning rates were assigned to BERT and the remaining modules (BERT: 2 × 10⁻⁵; others: 1 × 10⁻⁴) to balance knowledge retention and task adaptation. The contextual vector of each token output by the encoder provides global contextual information for the subsequent generation of response steps.

(2) Autoregressive Generation of Response Steps

The sequence of response steps is generated by an autoregressive Transformer decoder (see Section 2.3.2). To clearly define the prediction space of the decoder, all unique response step descriptions in the training set were deduplicated to construct a response step vocabulary of size

V = 2804

. Before deduplication, we applied deterministic surface-form normalization to reduce trivial textual variants. This process is fully data-driven and does not involve manual or semantic clustering of step expressions; therefore, each vocabulary item corresponds to a normalized surface form observed in the training data. This vocabulary consists of the unique response step phrases and four special tokens: <pad>, <sos>, <eos>, and <unk>. Here, <pad> is used for sequence padding and is ignored in the loss computation.

Different from the BERT tokenizer and the Jieba-based preprocessing vocabulary, this response step vocabulary defines the Softmax classification space of the decoder output layer and directly participates in the response step prediction loss (Equation (6)). At each decoding time step, the decoder takes the previously generated step token as input and, together with the contextual information from the BERT encoder, predicts the probability distribution over the next response step category. During training, a linearly decayed teacher forcing strategy (from 0.5 to 0.1) was adopted to gradually reduce the probability of feeding the ground-truth step as the next input, which helped accelerate convergence and stabilize gradient propagation.

In addition, sinusoidal positional encodings are added after the input embedding layer of the decoder, enabling the model to explicitly perceive the order of response steps in the sequence. Conditioned on the contextual vectors from the BERT encoder, the Transformer decoder produces dual outputs at each time step to support multi-task learning: it outputs both (i) the probability distribution over response steps for decision-making at the current step and (ii) a step index embedding that serves as a query signal for the response-step-aware time prediction module.

(3) Response-Step-Aware Time Offset Prediction

In parallel with response step generation, the model introduces a response-step-aware time prediction module at each decoding step to estimate the time offset of the currently predicted step. Different from traditional multi-task frameworks that model each task independently [38], this module uses the decoder-produced embedding of the current response action index

e_{t}^{(a c t)}

as a query to achieve cross-task information interaction, thereby dynamically coupling semantic and temporal information.

The module adopts a dual-attention mechanism to extract information from a joint context matrix

H_{f u l l}

, which is constructed by concatenating the BERT encoder outputs and the decoder hidden states up to the current step. The attention operations are defined as:

c_{t}^{(a c t)} = A t t e n t i o n (Q = e_{t}^{(a c t)}, K = H_{f u l l}, V = H_{f u l l})

(1)

c_{t}^{(s t e p)} = A t t e n t i o n (Q = e_{t}^{(s t e p)}, K = H_{f u l l}, V = H_{f u l l})

(2)

where

e_{t}^{(a c t)}

denotes the embedding of the predicted response step at time

t

, and

e_{t}^{(s t e p)}

denotes the positional embedding at time

t

. The matrix

H_{f u l l} \in R^{(L_{e n c} + T_{t}) \times d_{m o d e l}}

represents the joint encoder–decoder context and is obtained by concatenating, along the sequence dimension, the encoder output

H_{e n c} \in R^{L_{e n c} \times d_{m o d e l}}

and the decoder hidden state sequence up to time

t

,

H_{d e c}^{\leq t} \in R^{T_{t} \times d_{m o d e l}}

. Here,

L_{e n c}

is the length of the input text sequence,

T_{t}

is the length of the response step sequence generated so far (including the start token), and

d_{m o d e l}

is the feature dimension of the attention layer.

The attention outputs are concatenated with the original embeddings:

V_{c o n c a t} = [c_{t}^{(a c t)}; c_{t}^{(s t e p)}; e_{t}^{(a c t)}; e_{t}^{(s t e p)}]

(3)

and passed through a feature fusion layer:

h_{f u s i o n} = R e L U (W_{f u s i o n} \cdot V_{c o n c a t} + b_{f u s i o n})

(4)

followed by a multilayer perceptron for time offset prediction:

y_{t} = {M L P}_{t i m e} (h_{f u s i o n})

(5)

where

[\cdot; \cdot]

denotes vector concatenation;

W_{f u s i o n} \in R^{d_{h i d d e n} \times {4 d}_{m o d e l}}

and

b_{f u s i o n} \in R^{d_{h i d d e n}}

are the weight matrix and bias vector of the fusion layer, learned during training;

d_{h i d d e n}

is the hidden feature dimension; and

{M L P}_{t i m e}

denotes the multilayer perceptron used for time regression (its detailed configuration is given in Section 3.1).

To alleviate the adverse effects of heterogeneous time scales on training stability, the time labels were transformed using

l o g 1 p (\cdot)

and standardized (subtracting the mean and dividing by the standard deviation) before training. During inference, the predicted values are first de-standardized and then transformed back to real time values via the exponential function, thereby recovering the actual time offsets.

(4) Adaptive Multi-Task Optimization

To jointly optimize response step prediction and time offset regression, an adaptive weighted multi-task learning module is constructed. At each time step, the model computes a classification loss

ℓ_{a c t}

for response step prediction and a regression loss

ℓ_{t i m e}

for time offset estimation. Cross-entropy loss

ℓ_{C E}

is used for step prediction, while Smooth L1 (Huber) loss

ℓ_{H u b e r}

is used for time prediction. Learnable loss weights are introduced to dynamically balance the two tasks. For each sample, the losses at all valid positions of the response step sequence were computed and aggregated with weights for backpropagation. To prevent gradient explosion, gradient norm clipping was applied to all parameters, including the loss weights. The loss functions are defined as:

ℓ_{a c t} = \frac{1}{N} \sum_{t = 1}^{N} ℓ_{C E} (P_{t}, y_{t}^{(a c t)})

(6)

ℓ_{t i m e} = \frac{1}{N} \sum_{t = 1}^{N} ℓ_{H u b e r} (y_{t}, y_{t}^{(t i m e)})

(7)

ℓ_{t o t a l} = λ_{a c t} \cdot ℓ_{a c t} + λ_{t i m e} \cdot ℓ_{t i m e}

(8)

where

N

is the length of the valid response step sequence;

P_{t} \in R^{V}

is the predicted probability distribution over response step categories at time step

t

(with

V

the size of the response step vocabulary);

y_{t}

is the standardized time prediction given by Equation (5);

y_{t}^{(a c t)}

and

y_{t}^{(t i m e)}

are the ground-truth step index and standardized time label at time step

t,

respectively; and

λ_{a c t}

and

λ_{t i m e}

are trainable loss weights.

The gradients of

ℓ_{t o t a l}

are backpropagated through all modules. The time regression gradients directly optimize the response-step-aware time prediction module and, via the decoder, also influence the encoder; the step classification gradients mainly optimize the decoder and encoder. This joint training process encourages the encoder to learn shared representations that are simultaneously informative for discrete response step prediction and continuous time offset regression, thus achieving end-to-end multi-task optimization.

(5) Inference and Output Reconstruction

During inference, a concatenated text input is tokenized and encoded to generate global contextual representations. The Transformer decoder autoregressively generates the action sequence, feeding the predicted action index simultaneously into the time prediction module. The predicted normalized time offsets are converted back to absolute timestamps and accumulated sequentially until the end token is produced, yielding a structured tuple sequence representing the event response procedure.

3. Results and Discussion

The experiments were conducted using real-world highway traffic incident records collected from the Yunnan provincial expressway network. The dataset contains 4128 incident records, covering three major categories: traffic accidents, congestion events, and vehicle mechanical failures. To ensure consistent class distribution across the training, validation, and test sets, stratified sampling was applied, dividing the dataset at a ratio of 70% for training, 10% for validation, and 20% for testing based on incident category.

3.1. Experimental Environment and Configuration

The experiments were carried out under the following hardware and software environment:

CPU: Intel(R) Xeon(R) Silver 4215R @ 3.20 GHz.
GPU: NVIDIA GeForce RTX 3090.
Programming Language: Python 3.12.
Deep Learning Framework: PyTorch v2.5.
Transformer Library: v4.46.

The key hyperparameters used in model training are summarized in Table 2.

Regarding the response-step-aware time prediction module, the

{M L P}_{t i m e}

was implemented as a three-layer multilayer perceptron (MLP). The detailed structure is shown in Table 3. All attention layers use a feature dimension of

d_{m o d e l} = 768

, and the internal hidden layer dimension of the time prediction module was set to

d_{h i d d e n} = 512

.

3.2. Performance Evaluation

To evaluate the effectiveness of the proposed model in highway incident duration prediction, four categories of baseline models were selected for comparative analysis. These baselines represent mainstream techniques in current text-driven prediction tasks and can be categorized into two groups based on prediction paradigm:

Total duration prediction paradigm: Large language model fine-tuning, pretrained regression models, and temporal gated models. These models directly learn the mapping between input text and the overall incident duration.
Process-level prediction paradigm (ablation baseline): A standard Seq2Seq model designed to verify the effectiveness of the proposed module components.

The baselines were constructed as follows:

Low-rank adaptation (LoRA)-tuned large language model (LLM): Based on GLM4-9B-Chat, the model was fine-tuned using a corpus constructed from incident type and alert description to learn the correspondence between response steps and incident context.
BERT-based regression model: A pretrained BERT encoder converted the concatenated incident type and alert description into a dense vector representation, followed by a regression layer to estimate incident duration.
BERT + GRU model: Textual features extracted by BERT were combined with time offset sequences and fed into a GRU temporal module to capture sequential dependencies and generate the final regression value through a fully connected layer.
Standard Seq2Seq model (ablation baseline): This model retained the same BERT encoder and Transformer decoder architecture as the proposed model but removed the response-step-aware time prediction module. The decoder directly regressed time offsets using hidden states at each step, without explicitly modeling the attention-based interaction between action indices and contextual representations.

All baselines were trained using the same input text and optimization configurations. The comparison of time prediction performance and response step generation performance across models on the test set is shown in Table 4.

(1) Time Prediction Performance

The results reveal distinct performance differences across models in the core task of time prediction. The LoRA-tuned LLM performs significantly worse than the other models in numerical regression, reflecting that general-purpose language models, while capable of semantic-level text generation, lack sensitivity to precise numerical outputs without explicit numerical modeling components. As LLMs are optimized on discrete token probability distributions rather than continuous value spaces, the model tends to preserve semantic coherence rather than approximate numerical accuracy, even after fine-tuning [39,40]. The notably high MAPE further reflects the sensitivity of percentage-based errors to short-duration cases, where small ground-truth values can amplify relative deviations [41].

The BERT-based regression model performs better than the LLM but still exhibits considerable error due to compressing complex incident descriptions into a single static vector, ignoring the decomposed structure and cumulative nature of response procedures. The BERT + GRU model achieves the best performance among total-duration prediction models in terms of RMSE, MAE, and SMAPE, indicating that introducing temporal structures helps capture implicit time dependencies in textual descriptions [14,42]. Meanwhile, the lower MedAE and MAPE of the BERT-based regression model suggest that it performs comparatively well on typical cases and yields lower percentage errors, whereas the BERT + GRU model reduces overall errors more effectively, implying that modeling temporal structure is particularly beneficial for mitigating larger deviations. However, the BERT + GRU model still takes the entire text as a holistic feature input and does not parse fine-grained factors influencing individual step durations, resulting in residual prediction deviations.

The standard Seq2Seq model performs slightly better than the BERT-based regression model in terms of RMSE and MAE, but remains worse than BERT + GRU. This may be attributed to the decoder hidden state simultaneously supporting step generation and time regression, causing interference between semantic generation and numerical prediction [43,44]. Without a dedicated module for explicit temporal modeling, shared representations struggle to balance “accurate response step generation” and “precise time offset regression,” which is also reflected by its relatively higher SMAPE.

In contrast, the proposed multi-task prediction model achieved the best results across all time prediction metrics (RMSE = 18.05, MAE = 14.69, MAPE = 37.13%, MedAE = 13.23, SMAPE = 33.55%). Compared with the standard Seq2Seq baseline, the introduction of the response-step-aware time prediction module explicitly encodes the relationship between response steps and time offsets, partially decoupling time regression from shared representations [45]. This enables finer-grained temporal modeling, validating the effectiveness of this module and demonstrating that the model learns step-level duration variations and leverages inter-step dependencies to improve prediction accuracy. In addition, to verify the robustness and statistical reliability of the observed performance gains, we conducted 5-fold stratified cross-validation and paired bootstrap significance analysis; the detailed protocols and results are reported in Appendix A (Table A1 and Table A2), which further support the conclusions in this section.

(2) Response Step Generation Performance

In response step generation, the LoRA-tuned LLM, the standard Seq2Seq baseline, and the proposed multi-task model all exhibit high-quality generation performance (BLEU-4 above 60% and ROUGE-L above 80%). In addition, to assess step- and sequence-level correctness beyond n-gram similarity, we performed step-level alignment between prediction and ground truth using dynamic programming sequence alignment with similarity scores. The alignment-based Step Accuracy and Sequence-level Accuracy, along with the detailed alignment procedure, are reported in Appendix B (Table A3). This indicates that incorporating the response-step-aware mechanism does not weaken the model’s ability to learn procedural logic.

The case study in Table 5 illustrates this result. For a rear-end collision scenario, the proposed model generates a logically coherent standardized response chain—from “notify coordinated departments” to “arrival at the scene,” “towing operation,” and “traffic restoration.” However, the model predicts “closing overtaking lane” instead of the correct “closing driving lane”, indicating that the model still lacks nuanced learning at the rule-level granularity, particularly in mapping fine-grained incident attributes to lane management decisions.

Overall, while minor deviations remain at the detail level, the proposed model maintains strong performance in both time prediction accuracy and response step generation quality. The paradigm of jointly predicting procedural steps and their associated time offsets transforms traditional “black-box duration regression” into a more interpretable process-oriented prediction, providing structured and transparent decision support for highway incident management.

3.3. Discussion

The results in Table 4 suggest that modeling incident duration as a process rather than a single scalar can yield both improved accuracy and better operational interpretability. Prior reviews have emphasized that incident duration prediction is challenging due to heterogeneous influencing factors, unobserved operational dynamics, and limited interpretability of purely “black-box” predictors, even when predictive performance is acceptable. These issues motivate approaches that align model outputs with incident management stages and actionable procedures. In this context, the performance of the proposed model indicates that jointly learning “what actions are taken” and “when they occur” provides a useful inductive bias for duration inference, consistent with the broader literature that sequential/temporal structures (e.g., RNN/GRU/LSTM-based designs) can enhance incident duration forecasting compared with static text-to-duration regression. By contrast, the relatively weak regression accuracy of the LoRA-tuned LLM baseline is consistent with findings that standard LLM decoding and training objectives are fundamentally token-based and therefore lack a native inductive bias for precise continuous-value regression, especially when numerical magnitudes must be represented through discrete tokens and learned via cross-entropy. The improvement over the standard Seq2Seq ablation further suggests that explicitly conditioning time estimation on response-step representations helps mitigate the well-known issue of task interference in multi-task settings and makes the regression target more “step-aware,” which is aligned with established multi-task learning evidence that careful task coupling and loss balancing are important for avoiding negative transfer. Overall, these observations support the conclusion that process-level, step-conditioned modeling can simultaneously improve predictive accuracy and provide structured outputs that are more directly usable for traffic incident management.

3.4. Performance Comparison Across Incident Types

Considering the significant class imbalance inherent in real-world highway incident data (as discussed in Section 2.1, traffic accident samples account for 74.42% of the total), the test set was further divided into three subsets corresponding to each incident category to evaluate the generalization capability and robustness of the proposed model across different scenarios. The prediction errors for each incident type are summarized in Table 6.

The results show varying performance across incident categories. Regarding time prediction, the model performs best on mechanical failure incidents and worst on congestion events in terms of absolute-error metrics (RMSE, MAE, and MedAE). A possible explanation is that mechanical failures generally follow more standardized response procedures with relatively concentrated duration distributions, enabling the model to consistently learn the associated temporal patterns despite a smaller sample size. In contrast, congestion events are triggered by diverse causes and influenced by multiple factors—traffic volume, management strategies, and environmental conditions—leading to higher uncertainty in duration and making stable time offset prediction more difficult. Notably, although congestion events show the largest absolute errors, they exhibit relatively lower percentage errors (MAPE and SMAPE), which may be attributed to their typically longer durations; when ground-truth durations are larger, the same magnitude of absolute deviation corresponds to a smaller relative error.

For response step generation, the model performs best on congestion events while performing weaker on mechanical failures. This discrepancy may stem from the fact that congestion events often adhere to highly consistent procedural patterns, allowing the model to capture step sequences effectively even with limited data. However, response reports for mechanical failures contain more geographical and location-specific details, resulting in novel or rare expressions that fall outside the learned action vocabulary V. Consequently, predictions may contain a higher proportion of unknown action tokens (“<unk>”), which negatively affects sequence quality. Furthermore, the procedural uniformity in congestion events may partially explain the weaker performance in time offset prediction for this category; consistent step patterns may provide limited temporal variance to differentiate duration outcomes. This phenomenon is also consistent with the observation that congestion events remain challenging under absolute-error metrics (RMSE/MAE/MedAE), even though their percentage errors (MAPE/SMAPE) are comparatively smaller, which may be partly attributed to their typically longer durations.

Overall, although performance varies across event types, the proposed model maintains predictive capability even for minority classes, suggesting that the multi-task framework grounded in pretrained language models mitigates performance degradation under data imbalance and exhibits a level of domain transferability. The results also demonstrate that procedural predictability improves response step generation but does not necessarily imply temporal predictability. In other words, the presence of a consistent workflow does not guarantee sufficient time-related cues for duration inference, indicating a non-synchronized relationship between structural predictability and temporal predictability.

4. Conclusions

In practical highway networks, information related to traffic incidents is often incomplete or cannot be fully captured at the time of occurrence. To address this challenge, this study investigates the relationship between textual semantics, response procedures, and incident durations by using the incident type and alert description as model inputs, and proposes a BERT–Transformer-based multi-task prediction model. The model first performs autoregressive response step generation through a Transformer decoder and then estimates the time offset of each step via a response-step-aware time prediction module. A multi-task learning mechanism subsequently optimizes action prediction and time regression jointly. Experimental results demonstrate that the proposed model significantly outperformed LoRA-tuned LLMs, BERT-based regression models, and BERT + GRU models in overall duration prediction, while maintaining robust performance across different incident types under imbalanced data conditions. Compared with a standard Seq2Seq model, the introduction of the response-step-aware module substantially improves time prediction accuracy without degrading text generation quality, validating the effectiveness and scalability of this design for fine-grained temporal modeling.

This study provides a process-interpretable prediction paradigm with step-level inference capability for multi-category highway incident management, supporting the shift from traditional black-box duration regression toward transparent and explainable process-driven prediction. Such a framework offers potential practical value for traffic management decision-making, resource allocation, and emergency response. In particular, the experimental results provide quantitative evidence for these advantages. The proposed model achieves an RMSE of 18.05 min and an MAE of 14.69 min for duration prediction, while maintaining strong procedure-generation performance (BLEU-4 = 62.33% and ROUGE-L = 82.04%). These results indicate that step-conditioned temporal modeling can improve numerical accuracy without sacrificing procedural coherence. The evaluation across incident categories also suggests heterogeneous difficulty, with mechanical failures showing lower prediction errors and congestion events remaining more challenging due to higher operational uncertainty. In addition, the cross-validation and paired bootstrap analyses reported in the Appendix support the stability and statistical reliability of the observed improvements.

5. Limitations

Nevertheless, several limitations remain. First, the proposed model infers the response procedure solely from the initial alert text. However, response steps are inherently dynamic and may change with real-time traffic conditions, resource coordination, and operational feedback; the current framework performs one-shot prediction without an explicit online updating mechanism. Second, although alert descriptions contain heterogeneous cues (e.g., time, spatial context, and facility descriptors), the current model mainly relies on incident type and textual description and does not fully incorporate spatial/structural or strategic-level information. Third, the experiments were conducted on incident records from the Yunnan expressway network; differences in network topology and operational mechanisms may affect cross-regional transferability. Fourth, the dataset was filtered to three major incident types and durations within 10–120 min, which improves representativeness but may limit generalization to rare incident categories and extreme-duration cases. Finally, the naturally imbalanced distribution across incident types may lead to higher uncertainty in minority classes, and further validation under more diverse and balanced conditions is needed.

6. Future Work

Future research may integrate multi-source heterogeneous data and develop a dynamic prediction framework that incorporates textual semantics, road network structure, traffic operating conditions, real-time response feedback, and policy interventions, enabling online updating of remaining steps and incident duration and improving robustness under cross-regional and rare-event scenarios.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/vehicles8010005/s1. Table S1. Chinese original texts for the sample records reported in Table 1 of the main manuscript; Table S2. English translations of the Chinese original texts in Table S1 (corresponding to Table 1 in the main manuscript); Table S3. Chinese original texts for the case study reported in Table 5 of the main manuscript (ground truth and model-generated response process); Table S4. English translations of the Chinese original texts in Table S3 (corresponding to Table 5 in the main manuscript).

Author Contributions

Conceptualization, F.F. and X.F.; methodology, F.F. and X.F.; software, F.F. and J.H.; validation, F.F., J.H. and X.F.; formal analysis, F.F., J.H. and X.F.; data curation, F.F., J.H. and X.F.; writing—original draft preparation, F.F.; writing—review and editing, F.F. and X.F.; visualization, F.F.; supervision, F.F., J.H. and X.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the use of anonymized and de-identified traffic management log records provided by the transportation authority, which contain no personally identifiable information.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are owned by the transportation management authority and are subject to internal confidentiality agreements; therefore, they are not publicly available. Data and code supporting the findings of this study may be available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Robustness and Statistical Significance Analyses

Appendix A.1. Rationale for Baseline and Metric Selection

To evaluate the robustness and statistical significance of the time prediction results, additional analyses are conducted beyond the hold-out setting. For a concise yet informative comparison, the robustness and significance analyses focus on the proposed model and two strong representative baselines, BERT + GRU and Standard Seq2Seq. These baselines are selected because they achieve competitive performance in the main experiments and represent widely adopted neural architectures for sequence modeling and regression.

For evaluation, we report RMSE, MAE, MAPE, MedAE, and SMAPE as the time prediction metrics, consistent with the main results section. RMSE reflects sensitivity to large errors, MAE measures the average absolute deviation, and MedAE provides a robust measure of typical error by reducing the influence of outliers. MAPE and SMAPE quantify relative error in a scale-normalized manner, with SMAPE being less sensitive to very small denominators. Together, these metrics provide a comprehensive assessment of both absolute and relative time prediction performance.

Appendix A.2. Stratified 5-Fold Cross-Validation Protocol

We performed 5-fold stratified cross-validation over the full dataset to evaluate the stability of the time prediction performance under different data partitions. In each fold, four folds were used for training and the remaining fold was used for testing. The stratification was conducted according to the event category labels to preserve the class proportions across folds.

To prevent information leakage, all data-dependent preprocessing statistics (e.g., vocabulary statistics derived from the training data) were computed using only the training portion of each fold and then applied unchanged to the corresponding validation/test data in that fold. Model hyperparameters and training settings followed the main experimental configuration.

The cross-validation results are reported as mean ± standard deviation across the five folds in Table A1.

Table A1. 5-fold stratified cross-validation results (mean ± std).

Model	RMSE	MAE	MAPE (%)	MedAE	SMAPE (%)
BERT + GRU	23.16 ± 0.12	18.87 ± 0.31	51.05 ± 2.99	16.73 ± 0.81	38.75 ± 0.84
Standard Seq2Seq	24.55 ± 0.97	19.53 ± 0.79	48.56 ± 2.85	15.57 ± 0.90	52.24 ± 2.78
Proposed Multi-Task Model	18.09 ± 0.46	14.68 ± 0.31	37.18 ± 0.49	13.35 ± 0.27	34.15 ± 1.28

Appendix A.3. Paired Bootstrap Significance Analysis

To examine whether the observed improvements are statistically reliable, we conducted a paired bootstrap resampling analysis on the hold-out test set predictions. Specifically, we repeatedly sampled the test instances with replacement (B = 1000 resamples) and computed RMSE, MAE, MAPE, MedAE, and SMAPE for the proposed model and each baseline on the same resampled set, yielding an empirical distribution of the metric differences.

We report the mean performance gain (baseline minus proposed) and its 95% confidence interval (CI). A gain is considered statistically significant if the 95% CI does not include zero. The results are summarized in Table A2.

Table A2. Paired bootstrap significance test (B = 1000).

Comparison	ΔRMSE (Mean, 95% CI)	ΔMAE (Mean, 95% CI)	ΔMAPE (Mean, 95% CI) (%)	ΔMedAE (Mean, 95% CI)	ΔSMAPE (Mean, 95% CI) (%)
BERT + GRU vs. Proposed	5.06 [3.15,7.11]	4.28 [2.46,6.26]	14.82 [11.68,21.66]	3.42 [0.89,5.68]	4.98 [1.86,9.06]
Standard Seq2Seq vs. Proposed	6.35 [5.01,7.99]	4.68 [3.39,6.13]	12.08 [8.67,15.70]	2.29 [0.88,4.93]	17.77 [14.05,21.57]

Appendix B. Alignment-Based Step- and Sequence-Level Evaluation

Appendix B.1. Motivation

BLEU-4 and ROUGE-L quantify n-gram/sequence overlap at the text level, which is suitable for assessing overall generation quality. However, response-step generation produces an ordered list of procedural steps, where correctness also depends on whether the predicted steps can be aligned to the ground-truth steps in a consistent order. To complement BLEU/ROUGE with step- and sequence-level correctness, we introduce an alignment-based evaluation protocol and report Step Accuracy and Sequence-level Accuracy under multiple matching thresholds.

Appendix B.2. Similarity Function: ROUGE-L F1 at Step Level

Let the predicted step list be

P = {p_{1}, \dots, p_{T^{'}}}

and the ground-truth list be

G = {g_{1}, \dots, g_{T}}

. We compute a step-level similarity score

s (i, j)

between each pair

(p_{i}, g_{i})

using ROUGE-L F1, which is suitable for short step fragments:

s (i, j) = ROUGE - L_{F 1} (p_{i}, g_{i})

(9)

A predicted step and a ground-truth step are considered a match if

s (i, j) \geq τ

, where

τ

is a threshold controlling the strictness of matching. In this appendix we report results for multiple

τ

values, including

τ = 1.00

, which corresponds to strict exact matching after normalization.

Appendix B.3. Dynamic Programming Alignment

Direct position-wise comparison can be overly sensitive to step splitting/merging (e.g., one ground-truth step expressed as two predicted steps). Therefore, we perform global step-level alignment between

P

and

G

using dynamic programming sequence alignment to find an order-preserving alignment that maximizes total similarity:

Matching score between $p_{i}$ and $g_{i}$ : $s (i, j)$ .
Gap operation (unmatched predicted/ground-truth step): constant penalty $γ$ (set to $- 0.1$ in our experiments).

This yields an alignment path consisting of matched pairs

(i, j)

and gaps, providing a robust mapping between predicted and ground-truth step sequences while respecting procedural order.

Appendix B.4. Metrics Derived from Alignment

Based on the alignment result and the threshold

τ

, we compute:

(1) Step Accuracy (SA).

We define step-level correctness as the coverage of ground-truth steps by aligned predicted steps:

S A = \frac{|{g_{i} | \exists p_{i} : (i, j) \in A \land s (i, j) \geq τ}|}{T}

(10)

where

A

denotes the set of aligned index pairs. This formulation is robust to step splitting: a ground-truth step is considered covered if at least one aligned predicted step matches it above

τ

.

(2) Step Precision (SP) and Step F1.

To penalize over-generation or redundant splitting, we also compute precision on the predicted side:

S P = \frac{|{p_{i} | \exists g_{i} : (i, j) \in A \land s (i, j) \geq τ}|}{T^{'}}

(11)

F 1 = \frac{2 \cdot S A \cdot S P}{S A + S P}

(12)

(3) Strict Sequence-level Accuracy (SeqAcc).

To reflect exact procedural reproduction, we define strict sequence-level accuracy as 1 only when the entire predicted step sequence matches the ground truth under a one-to-one, gap-free alignment with equal length and all aligned pairs satisfying the threshold:

S e q A c c = \{\begin{matrix} 1, T^{'} = T \land \forall i, s (i, j) \geq τ \\ 0, o t h e r w i s e . \end{matrix}

(13)

This strict metric is intentionally conservative and highlights cases where the model reproduces the complete sequence without step splitting/merging.

Appendix B.5. Results and Discussion

Table A1 reports the alignment-based metrics for the proposed multi-task model and the standard Seq2Seq baseline across several thresholds

τ

, with gap penalty

γ = - 1

. As expected, increasing

τ

makes matching stricter and lowers SA/SP/F1/SeqAcc, whereas lower thresholds allow partial overlaps to count as matches.

(1) Overall trends.

For both models, SA and SP increase as

τ

decreases from 1.0 to 0.7. For the proposed model, SA rises from 0.551

(τ = 1.00)

to 0.747

(τ = 0.70)

, and SeqAcc rises from 0.112 to 0.243. Similarly, the Seq2Seq baseline improves from SA 0.555 to 0.741, and SeqAcc from 0.087 to 0.250 over the same threshold range. These patterns indicate that many predicted steps are semantically close but not strictly identical to the ground truth, consistent with the BLEU/ROUGE observations in the main text.

(2) Comparison between the proposed model and the baseline.

Across thresholds, the two models show comparable step-level coverage (SA) and precision (SP). The proposed model exhibits slightly higher strict sequence-level accuracy at higher thresholds (e.g., 0.112 vs. 0.087 at

τ = 1.00

; 0.134 vs. 0.105 at

τ = 0.95

), while the baseline is slightly higher at lower thresholds (e.g., 0.250 vs. 0.243 at

τ = 0.70

). This suggests that the proposed approach does not sacrifice procedural correctness, and that remaining errors are largely attributable to minor phrasing variation and/or step granularity differences (splitting/merging), rather than complete procedural divergence.

(3) Step-length statistics.

The average predicted step length is close to the average ground-truth length (proposed: 4.736 vs. 4.873; baseline: 4.583 vs. 4.873), indicating that both models generate a comparable number of steps on average, with the proposed model producing slightly longer sequences. This aligns with the observed sensitivity of strict SeqAcc to small length differences.

We report multiple thresholds to illustrate robustness;

τ = 1.00

corresponds to strict exact matching after normalization, while

τ \in [0.7,0.9]

reflects tolerant matching for semantically similar step paraphrases.

Table A3. Alignment-based step and sequence accuracy under different similarity thresholds

(γ = - 1)

.

Table A3. Alignment-based step and sequence accuracy under different similarity thresholds

(γ = - 1)

.

Model	$τ$	Mean_StepAccuracy (SA)	Mean_StepPrecision (SP)	Mean_StepF1	Mean_SeqAccuracy_Strict	Avg_Pred_Len	Avg_True_Len
Standard Seq2Seq	1.00	0.555	0.576	0.560	0.087	4.583	4.873
	0.95	0.569	0.590	0.574	0.105	4.583	4.873
	0.90	0.628	0.654	0.635	0.149	4.583	4.873
	0.85	0.666	0.692	0.673	0.174	4.583	4.873
	0.80	0.695	0.724	0.703	0.207	4.583	4.873
	0.75	0.724	0.755	0.733	0.236	4.583	4.873
	0.70	0.741	0.774	0.750	0.250	4.583	4.873
Proposed Multi-Task Model	1.00	0.551	0.559	0.550	0.112	4.736	4.873
	0.95	0.566	0.574	0.565	0.134	4.736	4.873
	0.90	0.625	0.634	0.624	0.167	4.736	4.873
	0.85	0.658	0.668	0.658	0.181	4.736	4.873
	0.80	0.690	0.701	0.689	0.199	4.736	4.873
	0.75	0.723	0.735	0.722	0.217	4.736	4.873
	0.70	0.747	0.758	0.746	0.243	4.736	4.873

References

Ji, Y.; Zhang, X.; Sun, L. A Review of Traffic Incident Duration Prediction Methods. Highw. Eng. 2008, 33, 72–79, 141. (In Chinese) [Google Scholar] [CrossRef]
Li, R.; Pereira, F.C.; Ben-Akiva, M.E. Overview of traffic incident duration analysis and prediction. Eur. Transp. Res. Rev. 2018, 10, 22. [Google Scholar] [CrossRef]
Wang, L.; Yang, K.; Fang, T.; Li, Y. Analysis of Highway Traffic Accident Severity Based on Historical Data. Traffic Technol. 2025, 14, 642–653. (In Chinese) [Google Scholar] [CrossRef]
Macioszek, E.; Wyderka, A.; Jurdana, I. The bicyclist safety analysis based on road incidents maps. Sci. J. Silesian Univ. Technol. Ser. Transp. 2025, 126, 129–147. [Google Scholar] [CrossRef]
He, Q.; Liu, J.; Li, S.; Cheng, R. Highway Traffic Accident Duration Prediction Based on SO-BiLSTM. J. Chongqing Jiaotong Univ. (Nat. Sci. Ed.) 2024, 43, 97–105. (In Chinese) [Google Scholar] [CrossRef]
Grigorev, A.; Mihaita, A.S.; Chen, F. Traffic incident duration prediction: A systematic review of techniques. J. Adv. Transp. 2024, 2024, 3748345. [Google Scholar] [CrossRef]
Zhu, L.; Zhang, Q.; Jian, X.; Yang, Y. Graph Convolutional Network for Traffic Incidents Duration Classification. Eng. Appl. Artif. Intell. 2025, 151, 110570. [Google Scholar] [CrossRef]
Chiang, H.S.; Liu, Q.Y. Traffic Incident Duration Prediction Based on Deep Learning Methods. Enterp. Inf. Syst. 2024, 19, 2448828. [Google Scholar] [CrossRef]
Jia, X.; Li, S.; Yang, H.; Chen, X. Highway Traffic Incident Duration Prediction Based on ATT-LSTM Model. Traffic Inf. Secur. 2022, 40, 61–69. (In Chinese). Available online: http://www.jtxa.net/cn/article/doi/10.3963/j.jssn.1674-4861.2022.05.007 (accessed on 12 December 2025).
Chang, H.; Chang, T. Prediction of freeway incident duration based on classification tree analysis. J. East Asia Soc. Transp. Stud. 2013, 9, 1964–1977. [Google Scholar] [CrossRef]
Hojati, A.T.; Ferreira, L.; Washington, S.; Charles, P.; Shobeirinejad, A. Modelling Total Duration of Traffic Incidents Including Incident Detection and Recovery Time. Accid. Anal. Prev. 2014, 71, 296–305. [Google Scholar] [CrossRef]
Tirtha, S.D.; Yasmin, S.; Eluru, N. Modeling of Incident Type and Incident Duration Using Data from Multiple Years. Anal. Methods Accid. Res. 2020, 28, 100132. [Google Scholar] [CrossRef]
Gu, Y.; Zhang, H.; Han, L.D.; Khattak, A. Modeling Spatiotemporal Heterogeneity in Interval-Censored Traffic Incident Recovery Time Using Crowdsourced Data. Accid. Anal. Prev. 2024, 195, 107406. [Google Scholar] [CrossRef]
Zhu, W.; Wu, J.; Fu, T.; Wang, J.; Zhang, J.; Shangguan, Q. Dynamic prediction of traffic incident duration on urban expressways: A deep learning approach based on LSTM and MLP. J. Intell. Connect. Veh. 2021, 4, 80–91. [Google Scholar] [CrossRef]
Abdi, A.; Seyedabrishami, S.; O’Hern, S. A Two-Stage Sequential Framework for Traffic Accident Post-Impact Prediction Utilizing Real-Time Traffic, Weather, and Accident Data. J. Adv. Transp. 2023, 2023, 8737185. [Google Scholar] [CrossRef]
Li, D.; Wu, J.; Peng, D. Online Traffic Accident Spatial-Temporal Post-Impact Prediction Model on Highways Based on Spiking Neural Networks. J. Adv. Transp. 2021, 2021, 9290921. [Google Scholar] [CrossRef]
Ulu, M.; Türkan, Y.S.; Mengüç, K.; Namlı, E.; Küçükdeniz, T. Dynamic forecasting of traffic event duration in Istanbul: A classification approach with real-time data integration. Comput. Mater. Contin. 2024, 80, 2259–2281. [Google Scholar] [CrossRef]
Tang, J.; Zheng, L.; Han, C.; Liu, F.; Cai, J. Traffic incident clearance time prediction and influencing factor analysis using extreme gradient boosting model. J. Adv. Transp. 2020, 2020, 6401082. [Google Scholar] [CrossRef]
Chen, J.N.; Tao, W.J.; Zhang, X.; Ma, L. Mediating Effect Analysis on Traffic Incident Discovery Time between Influence Factors and Duration Time on Expressways. In Proceedings of the CICTP 2023: Innovation-Empowered Technology for Sustainable, Intelligent, Decarbonized, and Connected Transportation, Beijing, China, 6–9 July 2023; American Society of Civil Engineers: Reston, VA, USA; pp. 1194–1203. [Google Scholar] [CrossRef]
Karndacharuk, A.; Hassan, A. Traffic incident management: Framework and contemporary practices. In Proceedings of the 39th Australasian Transport Research Forum (ATRF), Auckland, New Zealand, 27–29 November 2017. [Google Scholar]
Ji, K.; Chen, J.; Xiao, S.; Wang, X.; Liu, Y.; Fu, Z. Highway Accident Duration Prediction Model Driven by Text Data. Traffic Inf. Secur. 2020, 38, 9–16. (In Chinese). Available online: http://www.jtxa.net/cn/article/doi/10.3963/j.jssn.1674-4861.2020.06.002 (accessed on 12 December 2025).
Chen, J.; Tao, W.; Jin, Y.; Wang, P.; Zhang, J. Multimodal Text Information-Based Prediction of Highway Traffic Incident Duration. China Saf. Sci. Technol. 2023, 19, 180–186. (In Chinese). Available online: https://www.zhangqiaokeyan.com/academic-journal-cn_journal-safety-science-technology_thesis/02012101527207.html (accessed on 12 December 2025).
Ji, Y.; Zhang, X.; Sun, L. Traffic Incident Duration Prediction and Parameter Calibration. J. Chongqing Jiaotong Univ. (Nat. Sci. Ed.) 2010, 29, 613–620. (In Chinese). Available online: http://xbzk.cqjtu.edu.cn/CN/Y2010/V29/I4/613 (accessed on 12 December 2025).
Tong, S.; Cong, H.; Chen, Y. Fuzzy Logic Prediction Model for Highway Traffic Incident Clearance Time. J. Chongqing Jiaotong Univ. (Nat. Sci. Ed.) 2011, 30, 5–10. (In Chinese). Available online: https://d.wanfangdata.com.cn/periodical/cqjtxyxb201101021 (accessed on 12 December 2025).
Dogan, Y.; Tuysuzoglu, G.; Kiyak, E.O.; Ghasemkhani, B.; Birant, K.U.; Utku, S.; Birant, D. A Novel Reduced Error Pruning Tree Forest with Time-Based Missing Data Imputation (REPTF-TMDI) for Traffic Flow Prediction. Comput. Model. Eng. Sci. 2025, 144, 1677–1695. [Google Scholar] [CrossRef]
Zhou, X.; Zhou, Q.; Li, H.; Lv, X. Automatic Annotation of Temporal Information in Chinese Clinical Texts. Chin. J. Biomed. Eng. 2012, 31, 6–12. [Google Scholar] [CrossRef]
Zhao, H.; Fu, Z.; Wang, L. Chinese Sentiment Analysis Based on Feature Fusion. J. Lanzhou Univ. Technol. 2022, 48, 94–102. (In Chinese). Available online: https://journal.lut.edu.cn/CN/Y2022/V48/I3/94 (accessed on 12 December 2025).
Lin, T.; Huang, B.; Wu, Y.; Zhu, T. Analysis of Severe Traffic Accident Characteristics and Typical Scenarios Based on Text Data. China-Arab Sci. Technol. Forum 2025, 05, 79–83. (In Chinese). Available online: https://cstj.cqvip.com/Qikan/Article/Detail?id=7200839612 (accessed on 12 December 2025).
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Fan, H.; Qin, J.; Sun, H.; Zhang, L.; Lu, X. Traffic Accident Text Information Extraction Model Based on BERT and BiGRU-CRF. Comput. Mod. 2022, 10–15. (In Chinese). Available online: https://www.nstl.gov.cn/paper_detail.html?id=05553678b13c51a3062c087bf1e0b3ef (accessed on 12 December 2025).
Hu, L.; Yu, X.; Zhao, X.; Yang, Z.; Wang, X.; Hu, F.; Wu, J. Improved BERT Model-Based Method for Traffic Operation Situation Prediction. Highw. Traffic Technol. 2025, 42, 18–25. (In Chinese) [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar] [CrossRef]
Gong, D.; Lee, J.; Kim, M.; Ha, S.J.; Cho, M. Future Transformer for Long-Term Action Anticipation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3052–3061. [Google Scholar] [CrossRef]
Xiong, R.; Yang, Y.; He, D.; Zheng, K.; Zheng, S.; Xing, C.; Zhang, H.; Lan, Y.; Wang, L.; Liu, T.-Y. On Layer Normalization in the Transformer Architecture. In Proceedings of the 37th International Conference on Machine Learning (ICML), Online, 13–18 July 2020. [Google Scholar] [CrossRef]
Sheng, J.W.; Guo, S.; Yu, B.W.; Li, Q.; Hei, Y.; Wang, L.; Liu, T.; Xu, H. CasEE: A Joint Learning Framework with Cascade Decoding for Overlapping Event Extraction. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 6–11 August 2021; pp. 164–174. [Google Scholar] [CrossRef]
Qu, Y.; Kim, J. Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation. Sensors 2025, 25, 359. [Google Scholar] [CrossRef]
Gritsai, G.; Voznyuk, A.; Khabutdinov, I.; Grabovoy, A. Advacheck at GenAI Detection Task 1: AI Detection Powered by Domain-Aware Multi-Tasking. In Proceedings of the 1st Workshop on GenAI Content Detection (GenAIDetect), Abu Dhabi, United Arab Emirates, 18–21 January 2025; pp. 236–243. [Google Scholar] [CrossRef]
Palasundram, K.; Sharef, N.M.; KasMiran, K.A.; Azman, A. SEQ2SEQ++: A Multitasking-Based seq2seq Model to Generate Meaningful and Relevant Answers. IEEE Access 2021, 9, 164949–164975. [Google Scholar] [CrossRef]
Spathis, D.; Kawsar, F. The first step is the hardest: Pitfalls of representing and tokenizing temporal data for large language models. J. Am. Med. Inform. Assoc. 2024, 31, 2151–2158. [Google Scholar] [CrossRef]
Schwartz, E.; Choshen, L.; Shtok, J.; Doveh, S.; Karlinsky, L.; Arbelle, A. NumeroLogic: Number encoding for enhanced LLMs’ numerical reasoning. arXiv 2024, arXiv:2404.00459. [Google Scholar] [CrossRef]
Hyndman, R.J.; Koehler, A.B. Another look at measures of forecast accuracy. Int. J. Forecast. 2006, 22, 679–688. [Google Scholar] [CrossRef]
Chen, J.; Zhang, J.; Wang, P.; Jin, Y. A k-nearest text similarity-BiGRU approach for duration prediction of traffic accidents on expressways. Discov. Appl. Sci. 2025, 7, 744. [Google Scholar] [CrossRef]
Ruder, S. An overview of multi-task learning in deep neural networks. arXiv 2017, arXiv:1706.05098. [Google Scholar] [CrossRef]
Yu, T.; Kumar, S.; Gupta, A.; Levine, S.; Hausman, K.; Finn, C. Gradient surgery for multi-task learning. Adv. Neural Inf. Process. Syst. 2020, 33, 5824–5836. Available online: https://proceedings.neurips.cc/paper/2020/hash/3fe78a8acf5fda99de95303940a2420c-Abstract.html (accessed on 12 December 2025).
Kendall, A.; Gal, Y.; Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7482–7491. [Google Scholar] [CrossRef]

Figure 1. Distribution of Incident Duration (from incident detection to traffic restoration). The histogram shows the frequency distribution of incident durations, and the solid line represents the kernel density estimation (KDE) curve.

Figure 2. Construction of “Time–Action” Sequence Pairs. Solid and dashed lines indicate different types of links in the workflow. Colors and box styles are used for visual differentiation; elements with the same color denote the same entity throughout the figure.

Figure 3. Architecture of the BERT Model.

Figure 4. Architecture of the Transformer decoder. The dashed box denotes a decoder layer that is repeated multiple times.

Figure 5. Architecture of the Multi-Task Seq2Seq Model for Predicting Response Steps and Time Offsets. Solid arrows indicate the forward data flow during training and inference, whereas dashed arrows indicate gradient backpropagation (used only during training).

Table 1. Sample Records of Highway Traffic Incident Response Texts. The Chinese originals of the sample records are provided in Supplementary Table S1.

Subtypename	Alertdescribe	Ownerreport
Traffic Accident	At 20:30, on the Pingluo Expressway, in the direction from Pingyuan Street to Suolong Temple (between Asanlong Station and Suolong Temple Station), inside Yingzuiyan Tunnel No.1 at K1173+894, a truck overturned after a single-vehicle crash. A car failed to avoid the overturned vehicle and rear-ended it. Four occupants suffered no injuries. The overtaking lane was occupied, but traffic was not interrupted.	After discovering the crash, traffic police and tunnel management staff were already handling the scene. At 20:32, rescue and road administration departments were notified to head to the site; at 20:40, the management office reported to the Group Emergency Command Center; at 21:46, the rescue team arrived and began cargo removal; at 23:55, the incident was handled and traffic resumed.
Congestion Event	At 17:00, on the Anchang Expressway, in the direction from Chuxiong to An’ning, between K2324+000 and K2322+000 (from Dinosaur Valley to Changtian Station section), heavy traffic caused slow movement and congestion for approximately 2 km.	After congestion was detected via surveillance video, at 17:00, the traffic police were notified to implement traffic control measures; at 17:04, the management office immediately reported to the Group Emergency Command Center; at 20:25, traffic flow returned to normal.

Table 2. Model Training Hyperparameters.

Hyperparameter	Value	Hyperparameter	Value
Pretrained Model	bert-base-chinese	Teacher forcing rate	0.5
Optimizer	Adam	Max generation length	20
Learning rate	1 × 10⁻⁴	Encoder hidden size	768
Batch size	8	Decoder layers	3
Epochs	15	Attention heads	8
Dropout rate	0.1	Gradient clipping	1.0

Table 3. Multilayer Perceptron Architecture.

Layer	Operation	Input Dim.	Output Dim.	Activation/Regularization
Feature Fusion Layer	Linear	${4 \times d}_{m o d e l}$	$d_{h i d d e n}$	ReLU
Hidden Layer 1	Linear	$d_{h i d d e n}$	$d_{h i d d e n}$	ReLU + Dropout(0.2)
Hidden Layer 2	Linear	$d_{h i d d e n}$	$d_{h i d d e n} ∕ 2$	ReLU + Dropout(0.1)
Output Layer	Linear	$d_{h i d d e n} ∕ 2$	1	/

Table 4. Comparison of Time Prediction and Response Step Generation Performance.

Model	RMSE	MAE	MAPE	MedAE	SMAPE	BLEU-4	ROUGE-L
LoRA-tuned LLM	42.31	33.59	87.24%	30.00	53.98%	60.24%	82.22%
BERT-based Regression	25.37	19.37	44.98%	14.47	38.97%	N/A *	N/A *
BERT + GRU	23.16	18.87	51.18%	16.53	38.14%	N/A *	N/A *
Standard Seq2Seq	24.29	19.34	48.65%	15.50	51.25%	64.55%	81.11%
Proposed Multi-Task Model	18.05	14.69	37.13%	13.23	33.55%	62.33%	82.04%

* The BERT-based regression and BERT + GRU models output only the final duration and therefore do not support text generation; their response step metrics are not applicable (N/A).

Table 5. Case Study Comparison Between Ground Truth and Model-Generated Response Process. The Chinese originals of the case study texts are provided in Supplementary Table S3.

Category	Content
Input Information	Incident Type: Traffic Accident Alert Description: At 19:35, on the Anchang Expressway in the direction from Chuxiong to Anning, at K2288000 (between Anfengying Station and Caopu Station), a truck and a car were involved in a rear-end collision, occupying the driving lane in the Chuxiong-to-Anning direction. No injuries were reported and traffic was not interrupted.
Ground Truth Process	1. After discovering the crash (+0 min) 2. Notify traffic police, road administration, and rescue departments to head to the scene (+0 min) 3. The management office immediately reports to the Group Emergency Command Center (+9 min) 4. Traffic police and rescue units arrive at the scene (+35 min) 5. Towing and clearance conducted by closing the driving lane (+36 min) 6. Incident resolved and traffic resumed (+52 min) Actual Total Duration: 52 min
Model Prediction	1. After discovering the crash (+0.76 min) 2. Notify traffic police, road administration, and rescue departments (+1.34 min) 3. The management office immediately reports to the Group Emergency Command Center (+7.57 min) 4. Traffic police arrive at the scene (+11.70 min) 5. Rescue units arrive at the scene (+15.89 min) 6. Towing operation by closing the overtaking lane (+26.39 min) 7. Incident resolved and traffic resumed (+56.54 min) Predicted Total Duration: 56.54 min

Table 6. Prediction Performance Across Incident Types.

Incident Type	RMSE	MAE	MAPE	MedAE	SMAPE	BLEU-4	ROUGE-L
Traffic Accidents	17.00	13.98	38.24%	13.33	34.53%	64.16%	82.23%
Mechanical Failures	16.09	13.09	33.11%	11.23	31.15%	55.15%	75.58%
Congestion Events	24.54	20.31	34.85%	18.26	30.42%	87.62%	86.11%
All Incident Types	18.05	14.69	37.13%	13.23	33.55%	62.33%	82.04%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fan, F.; Hao, J.; Fu, X. Multi-Task Seq2Seq Framework for Highway Incident Duration Prediction Incorporating Response Steps and Time Offsets. Vehicles 2026, 8, 5. https://doi.org/10.3390/vehicles8010005

AMA Style

Fan F, Hao J, Fu X. Multi-Task Seq2Seq Framework for Highway Incident Duration Prediction Incorporating Response Steps and Time Offsets. Vehicles. 2026; 8(1):5. https://doi.org/10.3390/vehicles8010005

Chicago/Turabian Style

Fan, Fengze, Jianuo Hao, and Xin Fu. 2026. "Multi-Task Seq2Seq Framework for Highway Incident Duration Prediction Incorporating Response Steps and Time Offsets" Vehicles 8, no. 1: 5. https://doi.org/10.3390/vehicles8010005

APA Style

Fan, F., Hao, J., & Fu, X. (2026). Multi-Task Seq2Seq Framework for Highway Incident Duration Prediction Incorporating Response Steps and Time Offsets. Vehicles, 8(1), 5. https://doi.org/10.3390/vehicles8010005

Article Menu

Multi-Task Seq2Seq Framework for Highway Incident Duration Prediction Incorporating Response Steps and Time Offsets

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Description

2.2. Text Data Preprocessing

2.3. Multi-Task Seq2Seq Model Architecture

2.3.1. BERT Encoder

2.3.2. Transformer Decoder

2.3.3. Multi-Task Learning for Action Prediction and Time Offset Regression

3. Results and Discussion

3.1. Experimental Environment and Configuration

3.2. Performance Evaluation

3.3. Discussion

3.4. Performance Comparison Across Incident Types

4. Conclusions

5. Limitations

6. Future Work

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Robustness and Statistical Significance Analyses

Appendix A.1. Rationale for Baseline and Metric Selection

Appendix A.2. Stratified 5-Fold Cross-Validation Protocol

Appendix A.3. Paired Bootstrap Significance Analysis

Appendix B. Alignment-Based Step- and Sequence-Level Evaluation

Appendix B.1. Motivation

Appendix B.2. Similarity Function: ROUGE-L F1 at Step Level

Appendix B.3. Dynamic Programming Alignment

Appendix B.4. Metrics Derived from Alignment

Appendix B.5. Results and Discussion

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI