1. Introduction
Scaling legal reasoning to the millions of judgment documents produced annually by Chinese courts requires structured representations of the case facts currently locked in unstructured natural language [
1,
2]. Downstream applications such as similar-case retrieval, charge prediction, and sentencing assistance operate on discrete factual elements (e.g., employment duration, disputed amount, custody arrangement), yet the fact-description sections of judgment documents record these elements in free-form prose without explicit annotation. Legal element recognition, the task of tagging each sentence with the factual elements it conveys, is therefore the extraction bottleneck that determines how much of the judicial record becomes accessible to computational legal analysis. It is naturally formalized at the sentence level multi-label classification, where each civil-law domain imposes its own label schema over the factual elements specific to that area of law [
3]. Stated as a machine learning problem, each sentence is an instance that may be assigned zero or more labels from a fixed schema of
K domain-specific elements, making legal element recognition a canonical instance of multi-label text classification [
4].
Two structural properties of legal judgment corpora make this task substantially harder than balanced multi-label benchmarks. In a long-tail distribution, a small set of
head classes dominate training data while the majority of
tail classes each appear very infrequently [
5]; this imbalance is well studied in visual recognition but is equally acute in specialized legal NLP corpora. The label distribution is severely long-tailed: head-to-tail frequency ratios consistently exceed
, with the rarest labels having as few as 50 positive training instances [
3]. Simultaneously, more than 60% of sentences carry no label at all [
3], so that under binary cross-entropy (BCE) training every label dimension receives overwhelmingly negative gradients at each update. These two forces reinforce each other: the negative-gradient mass drives predictions toward all-zero outputs, while the already-sparse positive gradients of tail labels are further diluted, leaving the model with almost no informative signal near the decision boundaries of rare elements. The resulting symptom is a wide gap between Micro F1 and Macro F1 in which the rarest elements account for the largest per-label errors.
Loss reweighting is the most direct response to this imbalance. Class-Balanced Loss [
6] scales each label by its inverse effective sample count; Asymmetric Loss [
7] further suppresses easy-negative contributions through a shifted focusing term. These methods improve Macro F1 in moderately imbalanced regimes, but their weights are computed once from corpus-level frequencies and remain fixed throughout training, unable to shift toward the decision-boundary samples that become increasingly informative as the model improves. Other non-adaptive strategies share a related limitation: oversampling [
8] increases tail-label exposure but amplifies head-label gradients in multi-label sentences alongside them; contrastive representation learning [
9] improves embedding separability but does not redistribute the gradient magnitude across co-occurring labels during loss computation; post-hoc confidence calibration [
10] adjusts prediction scores after training but cannot recover the representational quality lost to the gradient imbalance accumulated during optimization. What all these approaches lack is a mechanism that adapts during training and operates at label-wise granularity within multi-label instances.
Meta-learning reweighting provides the missing adaptivity by deriving instance weights from a held-out meta-set at each training step. Ren et al. [
11] assign a scalar weight to each training sample and update it by differentiating through one gradient step on the meta-set loss; Meta-Weight-Net [
12] extends this idea by parameterizing the mapping as a learned network. Both approaches have advanced single-label long-tail recognition, yet a per-sample scalar weight amplifies all label gradients in that sample uniformly: when a head label and a tail label co-occur in one sentence, the tail label’s gradient cannot be independently adjusted. A second limitation concerns the meta-set itself: random or class-balanced sampling inherits the training distribution’s imbalance, producing a gradient signal dominated by easy negatives and head labels rather than by the tail-label boundary instances where the calibration signal is most needed. Effective meta-reweighting in this setting, therefore, requires two properties that no existing method provides jointly: label-wise weight granularity for intra-sentence gradient decoupling, and a meta-set concentrated on the tail-label decision boundaries for stable hypergradient direction.
We propose Boundary-aware Meta-Learning Transformer (BML-Trans) to satisfy both requirements within a single training framework. The “meta-learning” in BML-Trans refers specifically to label-wise meta-weighting (the title component “Label-Wise Meta-Weighting”), implemented via a bilevel update on a per-label weight vector calibrated against boundary-selected sentences. BML-Trans replaces per-sample weights with per-label weights updated via bilevel hypergradient descent against a curated meta-set, enabling the optimizer to amplify a tail label’s gradient independently of co-occurring head labels. The meta-set is constructed by selecting instances whose predictions lie near the decision boundary subject to a tail-label coverage constraint, ensuring that the gradient signal from the meta-set reflects tail-class difficulty rather than easy-negative dominance. The Multi-Scale Adapter, trained jointly during warm-up, improves early-stage probability estimates by enriching encoder representations at multiple convolutional scales, raising the quality of boundary-based selection in subsequent epochs and stabilising meta-weight updates.
On all three CAIL2019 domains, BML-Trans improves tail-label Macro F1 by up to 5.7 percentage points over ASL, the strongest static-reweighting baseline, and by up to 3.6 percentage points over Ren et al., the strongest sample-level meta-learning baseline, while maintaining competitive Micro F1 at only 14% additional training cost. Ablation experiments reveal a cascade dependency among the three components: removing the boundary meta-set calibration degrades meta-set quality, which in turn reduces the effectiveness of label-wise meta-weighting. Our main contributions are as follows:
A label-wise meta-weighting mechanism, updated via a one-step bilevel gradient descent, that independently scales each label’s gradient contribution within a shared training sentence, dissolving the per-sample coupling that prevents existing meta-reweighting methods from amplifying tail-label gradients without simultaneously inflating those of co-occurring head labels.
A boundary-aware meta-set construction procedure that selects calibration sentences by jointly scoring model uncertainty, prediction loss, and tail-trigger density, subject to a per-tail-label coverage constraint; the Multi-Scale Adapter trained during warm-up sharpens early probability estimates, which in turn raises meta-set quality and stabilizes hypergradient direction, a cascade dependency that the ablation study quantifies across all three domains.
Disaggregated evaluation across Head, Mid, and Tail frequency buckets on all three CAIL2019 domains, showing that BML-Trans improves Tail Macro-F1 by up to 5.7 percentage points over the strongest static-reweighting baseline and by up to 3.6 percentage points over sample-level meta-reweighting, at only 14% additional training cost, with mechanistic analyses confirming that the gains originate from the bilevel optimization rather than threshold selection or favorable seeds.
The rest of this paper is organized as follows.
Section 2 reviews legal element recognition, imbalance handling in multi-label classification, meta-learning reweighting, and multi-scale representations.
Section 3 formulates the task, describes the Multi-Scale Adapter and label-wise meta-weighting, and presents the boundary meta-set construction procedure.
Section 4 reports setup, main results, long-tail and ablation analyses, and mechanistic findings.
Section 6 summarizes the contributions and limitations.
3. Method
3.1. Overview
BML-Trans resolves the two failure modes identified in
Section 1 through a two-phase training procedure (
Figure 1). Gradient coupling across co-occurring labels is dissolved by a label-wise meta-weighting mechanism that maintains a per-label weight vector
updated via bilevel hypergradient descent. Head-label dominance of the calibration signal is corrected by a boundary-calibrated meta-set
, assembled by jointly scoring prediction loss, model uncertainty, and tail-trigger density after warm-up. A lightweight Multi-Scale Adapter supports both by enriching MacBERT-base representations at three convolutional scales, sharpening the probability estimates on which boundary scoring depends.
In Phase 1 (Warm-up), the backbone and adapter are trained with Asymmetric Loss to push predictions away from the all-zero attractor and to produce probability estimates reliable enough for boundary scoring;
is then assembled from the top-scoring
training sentences subject to a per-tail-label coverage constraint. In Phase 2 (Main Training), the weighted loss
is optimized jointly with periodic bilevel updates to
: because
multiplies only label
k’s loss, tail-label gradients can be amplified independently of co-occurring head labels, reallocating the gradient budget toward rare elements without coupling their updates through a shared scalar.
Section 3.3,
Section 3.4 and
Section 3.5 formalize each component.
3.2. Problem Formulation
Let
denote the space of Chinese sentence strings and
be the set of legal element labels for a given domain, where
in our setting. The goal is to learn a function
that maps a sentence
x to a
K-dimensional score vector. The predicted label vector is defined as
where
denotes the sigmoid function and
is the decision threshold. The ground-truth label vector is
.
Each legal domain is treated independently, and a separate model is trained for each of the three domains (Labor, Divorce, Loan) in the CAIL2019 dataset.
Let the labeled training set be , and the held-out test set be . We further construct a small boundary meta-set with .
The model parameters are denoted as
. In addition, BML-Trans maintains a label weight vector
, parameterized by
as
The learning rate for updating is , the meta-learning rate for is , and a meta-update is performed every M gradient steps.
3.3. Multi-Scale Adapter
BML-Trans uses MacBERT-base [
16] as its backbone encoder. Legal element triggers span a wide range of textual scopes: a single-token statutory term can uniquely signal one element, whereas other elements are characterized by multi-token phrases (named legal concepts or numerical conditions embedded in longer clauses) that require a wider receptive field to be reliably detected. Standard Transformer self-attention aggregates global context effectively but does not explicitly capture local n-gram patterns across fixed windows of varying size. To bridge this gap, we append a lightweight Multi-Scale Adapter after the final encoder layer that enriches token representations through three parallel depth-wise separable convolution branches.
Given a tokenized sentence of at most
tokens, MacBERT produces contextual hidden states
with
. Each branch applies a depth-wise separable convolution with a distinct kernel size to
,
so that the three branches capture local co-occurrence patterns across spans of three, five, and seven subword tokens, respectively. The depth-wise separable factorization [
40], comprising a per-channel spatial filter followed by a pointwise projection, substantially reduces parameter count relative to standard convolutions while preserving receptive-field coverage, making it well-suited for a lightweight adapter module atop a large pre-trained encoder.
The three branch outputs are concatenated along the feature dimension and projected back to dimension
d by a learned matrix
:
Rather than adding
unconditionally to the encoder output, we modulate its contribution with a gating scalar derived from the mean-pooled hidden state,
which allows the adapter to suppress its own output when the pre-trained representations already carry sufficient local detail, preventing the adapter from interfering with well-learned encoder representations. The final adapted sequence is obtained by a gated residual addition followed by layer normalization:
The sentence-level embedding is , and the K-way classification logits are with . In total, the Multi-Scale Adapter introduces fewer than 2.4 million additional parameters for , below 2% of MacBERT-base, ensuring that its contribution can be cleanly isolated in ablation studies and that GPU memory overhead remains negligible.
3.4. Label-Wise Meta-Weighting
When a training sentence carries both a high-frequency and a rare label simultaneously, a sample-level scalar weight amplifies or attenuates all label gradients in that sentence uniformly. Increasing the weight to help the rare label unavoidably inflates the already-dominant gradient signal of the head label as well, leaving no mechanism for independent adjustment. Label-wise meta-weighting dissolves this coupling by maintaining a separate scalar for each label , so that the optimizer can independently scale the gradient contribution of a tail label without proportionally inflating those of co-occurring head labels.
Let
denote the average binary cross-entropy loss for label
k over mini-batch
. The weighted training objective accumulates per-label losses as
where
is the current label weight vector. Because
multiplies only the loss contributions of label
k, the gradient of
with respect to the classification head
decomposes cleanly across label dimensions: scaling
scales the gradient for label
k alone without affecting any other label dimension.
The label weights are not fixed before training; they are updated at intervals of
M gradient steps by differentiating through a bilevel program. The calibration signal for this update comes from the boundary meta-set
, evaluated without label weighting:
Omitting label weights from Equation (
9) is deliberate: the meta loss measures raw model performance on the boundary sentences unconfounded by the current weight values, providing a direction whose sign reflects what the model genuinely needs rather than what it has already been asked to optimize.
At each meta-update step, we first compute a virtual one-step update of
using the current training batch
,
retaining the computation graph so that second-order derivatives are available. We then evaluate the meta loss at these hypothetical parameters and backpropagate through
and through the softplus reparameterization to obtain the hypergradient update on the unconstrained weight vector
:
Intuitively, Equations (
10) and (
11) ask, given the current label weights, would one gradient step with those weights have improved performance on the meta-set? The hypergradient captures this counterfactual sensitivity and updates the weights to reduce the meta-set loss. Following [
11], the one-step unrolling approximation avoids differentiating through the full training trajectory while maintaining a sufficiently accurate gradient direction for practical convergence.
After each meta-update, the weight vector is reconstructed from
, mean-normalized to preserve the total gradient budget, and clipped to a stable range:
where
,
, and
. Mean-normalization ensures that the weights redistribute gradient mass across labels rather than inflating the total magnitude, while clipping guards against divergence on extremely rare labels whose high tail-trigger density would otherwise drive
to arbitrarily large values over successive updates.
3.5. Boundary Meta-Set Construction
A randomly sampled meta-set mirrors the overall training distribution, which in CAIL2019 is dominated by fully negative sentences and head-label examples. Gradients computed against such a distribution push label weights toward values that prioritize head-label loss minimization, providing almost no calibration signal for the tail-label decision boundaries where improvement is most needed. The boundary meta-set construction corrects this bias by concentrating the calibration support on sentences that are simultaneously high-loss, near the decision boundary, and positive for at least one tail label. The construction proceeds in three stages.
In the first stage, the model is trained on with Asymmetric Loss for epochs to obtain an initial estimate . These warm-up epochs serve a dual purpose: they push predictions away from the all-zero attractor that BCE training gravitates toward under severe class imbalance, and they produce probability estimates reliable enough for the boundary scoring in the next stage. Because the Multi-Scale Adapter is trained jointly during warm-up, the richer local representations it produces directly improve the accuracy of the boundary scores and, by extension, the meta-set composition.
In the second stage, a composite boundary score is computed for each training sentence
i:
where each component is
z-score normalized before combination to equalize their scales. The loss score
is the unweighted BCE loss on sentence
i under
, measuring how difficult the sentence remains after warm-up. The uncertainty score
attains its maximum when sigmoid outputs are near
, identifying sentences where the model is closest to its decision boundary, and therefore most sensitive to label-weight perturbations. The tail-trigger score
accumulates inverse-frequency-weighted positive annotations restricted to the tail-label index set
, giving priority to sentences that carry rare labels and where the hypergradient is most consequential for tail-label performance. Together, the three scores identify sentences that are hard for the current model, near the decision boundary, and directly relevant to the tail labels whose weight updates matter most.
The three component scores are each
z-score normalized before being averaged in Equation (
13). This normalization is the mechanism that equalizes their magnitudes: regardless of the absolute value scales of
,
, and
, the normalized components have unit variance, so the equal-weight average reflects a genuine one-to-three balance of influence rather than an implicit weighting toward whichever component happens to have the largest raw scale. The three components are designed to capture
complementary rather than redundant information:
identifies global model difficulty,
identifies decision-boundary proximity, and
identifies tail-label relevance; no single criterion enforces all three constraints simultaneously. The equal weighting
is therefore a parsimonious default that avoids introducing additional hyperparameters for a three-way combination with complementary individual contributions.
In the third stage, the top
sentences ranked by
form the candidate meta-set. A coverage constraint is then applied: if any tail label
has fewer than
positive examples in the selected set, additional positive examples for that label are drawn from
in descending order of
until the constraint is satisfied. This constraint prevents the hypergradient from being blind to the rarest labels: a tail label absent from
would never contribute to the hypergradient and hence never receive an elevated weight, even if it were the label most in need of calibration. The complete two-phase training procedure is summarized in Algorithm 1.
| Algorithm 1: BML-Trans Training |
![Symmetry 18 00664 i001 Symmetry 18 00664 i001]() |
3.6. Error Tolerance of the Three-Phase Procedure
The BML-Trans training procedure comprises three sequential phases: warm-up training to obtain , boundary meta-set construction based on , and main bilevel training guided by the resulting . A natural concern is whether approximation errors in earlier phases accumulate and degrade the final model.
The Phase 1 to Phase 2 transition introduces warm-up approximation error. The boundary scores in Equation (
13) depend on the quality of
, which improves as the warm-up becomes more accurate. Crucially, however, the meta-set construction is not sensitive to the
precise ranking of every training sentence; it requires only that the top-
boundary sentences be broadly identified as harder, more uncertain, and more tail-relevant than the rest. Even after one or two warm-up epochs (at which point the model has already escaped the all-zero attractor and produces non-trivial predictions), this coarse ordering is sufficiently reliable for meta-set construction. The tail-label coverage constraint provides an additional safeguard: by requiring at least
positive examples per tail label, it ensures that every tail label is represented in
regardless of warm-up quality, preventing the degenerate case in which a slightly under-trained warm-up model fails to rank any examples of an extremely rare label highly.
The Phase 2 to Phase 3 transition relies on the one-step unrolling approximation in the hypergradient Equation (
11) rather than differentiating through the full training trajectory. This approximation has been analyzed theoretically in the context of bilevel optimization [
36]: the approximation error is bounded by
and therefore diminishes as the main learning rate
decreases over training. In practice, the weight updates remain stable because the per-label weights are mean-normalized and clipped (Equation (
12)) at every meta-update step, bounding the magnitude of any single erroneous hypergradient step. Errors from Phase 1 therefore affect only the
initialization of the hypergradient direction, not its asymptotic behavior: the bilevel loop corrects stale meta-set calibration implicitly as
evolves and the meta-set loss guides
toward better values.
3.7. Complexity Analysis
The dominant cost of the encoder forward pass is the full self-attention in MacBERT,
per layer; the Multi-Scale Adapter adds only
, which is negligible since
for typical sentence lengths. Each meta-update requires one additional forward pass over
with graph retention, one forward pass over
, and one backward pass through the computation graph, approximately three times the cost of a standard training step. With meta-update interval
and a reduced meta batch size of
, the amortized wall-clock overhead is empirically approximately 14%;
Section 4.6 reports the systematic evaluation across
M values. The boundary meta-set construction (warm-up plus scoring) is a one-time pre-processing step that completes in fewer than 10 min on the hardware described in
Section 4.1.3 and is not included in the reported training time comparisons.
5. Discussion
5.1. Failure Mode Analysis and Behavior on Extremely Rare Labels
Examination of the remaining errors reveals two systematic failure modes. The first is tail-label misclassification under lexical ambiguity: when a sentence contains surface tokens that are genuinely ambiguous across two semantically adjacent labels (for example, a monetary amount that could trigger either the salary arrears or the economic compensation label in the Labor domain), BML-Trans elevates both label weights and occasionally assigns a false positive to the less relevant label, accounting for a disproportionate share of precision loss on mid-frequency labels that overlap semantically with more common neighbors.
The second failure mode is particularly relevant to the behavior of BML-Trans on extremely rare labels (those in the lowest sub-bucket of the Tail partition, with fewer than approximately 15 positive training instances in the Labor domain; labels DV17 and DV18 in the Divorce domain have as few as 105 and 144 positive instances, respectively). For such labels, the coverage constraint () in the boundary meta-set construction is often satisfied by the absolute minimum of three positive examples. Even with elevated gradient weights, the classifier lacks sufficient contrastive signal to reliably distinguish these labels from the null prediction: the three meta-set examples provide direction for the hypergradient, but insufficient variance to regularize the weight update reliably. The result is a high false-negative rate for these extreme-tail labels, which accounts for the performance gap between BML-Trans and a hypothetical oracle that had access to more positive instances. Reducing the required minimum m below 3 would relax the constraint but weaken calibration for all tail labels; data augmentation via synonym substitution and back-translation for coverage-constrained labels would be the most direct remedy, as it would supply the additional positive evidence that the current training corpus alone cannot provide.
5.2. Computational Cost and Deployment Feasibility
The meta-update interval
is selected at the Pareto-optimal point identified in
Figure 7: it retains the peak Avg-F1 of frequent updates (
) at roughly one-third of their wall-clock cost, yielding only 14% overhead relative to standard BCE fine-tuning. This modest overhead reflects two design choices that together control cost: (i) meta-updates occur only once every
gradient steps rather than at every step, and (ii) the meta batch size
is kept small relative to the training batch. During inference, BML-Trans introduces no additional cost beyond the Multi-Scale Adapter (<2% parameter overhead): the label weight vector
is fixed after training and absorbed into the classification head, so runtime latency is identical to that of the MacBERT baseline.
From a deployment perspective, BML-Trans operates on a 102-million-parameter PLM that fits within a single 24 GB GPU at batch size 32, making it compatible with standard enterprise GPU hardware without multi-GPU communication overhead. In a production legal AI pipeline, where element recognition typically runs as a microservice over batches of newly filed judgment documents, the 14% training-time overhead is a one-time cost per domain model and is amortized over the full deployment lifetime of the model. The architectural output of BML-Trans is identical to that of the MacBERT baseline, a sigmoid-activated K-dimensional score vector, allowing it to be integrated into existing downstream pipelines (charge prediction, case retrieval) without any interface changes.
5.3. Meta-Set Dynamics and Future Extension to Dynamic Updating
The current implementation constructs the boundary meta-set
once at the end of the warm-up phase and holds it fixed throughout Phase 2. This static construction has a practical advantage: it avoids the overhead of recomputing boundary scores during main training, but it may become suboptimal as the model’s uncertainty landscape shifts: sentences that are near the decision boundary at the end of warm-up may no longer be informative boundary examples several epochs later, potentially diluting the calibration signal. A curriculum-style extension that periodically re-scores and refreshes the meta-set at fixed-epoch intervals would test whether dynamic tracking of the shifting decision surface further concentrates the calibration signal and improves tail-label recovery. Preliminary analysis of the label-weight trajectory in
Figure 4 suggests that tail-label weights stabilize within approximately 500 training steps, implying that a single refresh at the mid-point of Phase 2 might capture a substantial portion of the potential improvement without the full cost of continuous re-scoring.
5.4. Limitations and Broader Applicability
BML-Trans is applicable to tasks that simultaneously satisfy three conditions: multi-label output space, long-tailed label frequency distribution, and boundary instances identifiable from a warm-up model. Because the three scoring components (loss, uncertainty, and tail-trigger density) are computable from any warm-up model given only label-frequency statistics, the method requires no task-specific engineering beyond what is available during standard training, and analogous settings such as medical report labeling, news topic tagging, and patent claim categorization present the same structural conditions. Two limitations carry quantitative thresholds that define the operating range. For training sets with fewer than approximately 1000 labeled sentences per domain, or with fewer than ten positive instances for the rarest tail labels, the coverage constraint in boundary meta-set construction becomes infeasible at the default ; reducing m relaxes the constraint at the cost of a weaker calibration signal. All evaluations are conducted on CAIL2019, a Chinese-language legal benchmark, and transferability of the Tail Macro-F1 gains to other languages, legal systems, or long-tail multi-label domains requires further empirical validation.
6. Conclusions
Long-tail multi-label recognition of legal factual elements has resisted improvement from two established strategies: static reweighting cannot adapt as model confidence evolves, and sample-level meta-learning couples all co-occurring label gradients to a single scalar, leaving the tail labels unable to receive independent amplification. The experiments reported here identify label-wise gradient granularity and decision-boundary concentration of calibration signal as two jointly operative conditions that together enable the recovery of rare legal elements from imbalanced corpora. BML-Trans satisfies both conditions through a bilevel update on a per-label weight vector, guided by a meta-set concentrated on tail-label boundary sentences rather than easy negatives; the result is a Tail Macro-F1 improvement of up to 5.7 percentage points over the strongest static-reweighting baseline and a Tail Macro-F1 improvement of up to 3.6 percentage points over the strongest sample-level meta-learning baseline, at 14% additional training cost. The ablation cascade confirms that neither component achieves this margin alone: the Multi-Scale Adapter sharpens warm-up probability estimates, which raises boundary meta-set quality, which stabilizes the hypergradient direction for the label-wise weights, establishing that the gains are structural rather than incidental to threshold selection or random initialisation.
For legal AI systems that must recover rare factual elements (custody arrangement, disputed compensation, employment duration) to support downstream judgment prediction and similar-case retrieval, this finding reframes the central engineering challenge. All baselines in our evaluation share the MacBERT-base encoder, yet BML-Trans achieves the reported gains without any increase in backbone parameters. The bottleneck is not encoder capacity but the granularity and targeting of gradient corrections: recovering one additional rare element from a judgment document may require not a larger model, but a more precisely targeted learning signal.
The specific structure of this work opens three near-term extensions. The cascade between warm-up representation quality and meta-set fidelity suggests that a single boundary meta-set, constructed once at warm-up, may not track the shifting decision surface throughout training; periodically re-scoring and refreshing the meta-set at fixed-epoch intervals would test whether curriculum-style dynamics can further concentrate the calibration signal as the model’s uncertainty landscape evolves. The label-wise weight vector is small and backbone-agnostic, making the bilevel update directly applicable to parameter-efficient fine-tuning of larger language models, a setting that shares the defining conditions of the current task: long-tail label distributions and scarce positive examples per rare label, at substantially greater encoder capacity. Transferability to other long-tail multi-label domains (medical report labeling, news topic tagging, patent claim categorization) and to other languages and legal systems remains the broadest open question, one whose answer would determine whether label-wise gradient control is a specialized technique for the legal domain or a broadly applicable technique for imbalanced multi-label learning. On the dataset dimension, evaluating BML-Trans on additional Chinese legal benchmarks such as CAIL2020, which extends the element schema and covers more recent judgments, would provide further evidence of generalizability within the legal domain and is a direct next step.