Overcoming Domain Shift in Violence Detection with Contrastive Consistency Learning

Zhenche Xia; Zhenhua Tan; Bin Zhang

doi:10.3390/bdcc9110286

,

and

¹

School of Software, Northeastern University, Shenyang 110819, China

²

National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Northeastern University, Shenyang 110819, China

³

Key Laboratory of Data Analytics and Optimization for Smart Industry, Ministry of Education, Northeastern University, Shenyang 110819, China

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput.2025, 9(11), 286;https://doi.org/10.3390/bdcc9110286

Version Notes

Order Reprints

Abstract

Automated violence detection in video surveillance is critical for public safety; however, existing methods frequently suffer notable performance degradation across diverse real-world scenarios due to domain shift. Substantial distributional discrepancies between source training data and target environments severely hinder model generalization, limiting practical deployment. To overcome this, we propose CoMT-VD, a new contrastive Mean Teacher-based violence detection model, engineered for enhanced adaptability in unseen target domains. CoMT-VD innovatively integrates a Mean Teacher architecture to adequately leverage unlabeled target domain data, fostering stable, domain-invariant feature representations by enforcing consistency regularization between student and teacher networks, crucial for bridging the domain gap. Furthermore, to mitigate supervisory noise from pseudo-labels and refine the feature space, CoMT-VD incorporates a dual-strategy contrastive learning module. DCL systematically refines features through intra-sample consistency, minimizing latent space distances for compact representations, and inter-sample consistency, maximizing feature dissimilarity across distinct categories to sharpen decision boundaries. This dual regularization purifies the learned feature space, boosting discriminativeness while mitigating noisy pseudo-labels. Broad evaluations on five benchmark datasets unequivocally demonstrate that CoMT-VD achieves the superior generalization performance (in the four integrated scenarios from five benchmark datasets, the improvements were 5.0∼12.0%, 6.0∼12.5%, 5.0∼11.2%, 5.0∼11.2%, and 6.3∼12.3%, respectively), marking a notable advancement towards robust and reliable real-world violence detection systems.

Keywords:

violence detection; domain shift; Mean Teacher

1. Introduction

Violence detection (VD), a subtask of human action recognition, addresses a research area of substantial practical importance [,,,,]. The development of robust violence detection systems is paramount for curbing the proliferation of harmful content on digital platforms, enabling the proactive monitoring of potential violent incidents in real-world environments, and facilitating preventive measures against such behaviors. Collectively, these applications notably contribute to safeguarding public safety and maintaining the integrity of both online and physical spaces.

Driven by notable advancements in artificial intelligence, methodologies for violent detection have undergone a paradigm shift, transitioning from traditional manual feature engineering approaches [,] to sophisticated deep learning-based architectures. Current research largely employs three main model categories: (a) hybrid frameworks combining two-dimensional convolutional neural networks (2D CNNs) with temporal modeling modules [,,] to capture both spatial and sequential dynamics; (b) end-to-end spatiotemporal representation learning via three-dimensional convolutional neural networks (3D CNNs) [,,]; and (c) more recently, attention-based architectures, particularly Transformer models [,], which excel at modeling long-range dependencies in video data. While these contemporary models often demonstrate strong performance in controlled experimental settings, their efficacy in practical, real-world deployments is frequently hindered by notable challenges, predominantly performance degradation when encountering unseen or dynamically changing environments.

Concretely, the reliable real-world deployment of these models is principally impeded by the domain shift [] phenomenon. This challenge stems from a fundamental distributional discrepancy between the data used for training and the data encountered during actual deployment [,,,,,,]. In violence detection, this manifests acutely due to the vast, unpredictable variability inherent in real-world scenarios, which datasets, by their nature, can only partially represent. For example, authentic violent incidents in the real world frequently occur under dynamically complex visual conditions such as occlusions (Figure 1a), where key elements may be obscured, or in low-illumination environments (Figure 1b), where visual clarity is compromised. These characteristics of target domain often differ qualitatively and in their range of variation from the training data subsets, which might include specific scenarios like actions against a pure-color background (Figure 1c) or with high motion blur (Figure 1d). Models trained on limited violence datasets may encounter unknown environmental changes when applied to real-world scenarios. This situation, where the data distribution used during training is inconsistent with the data encountered in practical applications, is a manifestation of domain shift [].

Figure 1. Examples of domain shift in violent detection. (a) Violent behavior with occlusion in real-world scenarios. (b) Violent behavior with low light in real-world scenarios. (c) Pure-color background. (d) High motion blur.

Such divergence is practically unavoidable, as real-world scenarios are inherently open-ended, encompassing countless variations in environmental factors, camera perspectives, and scene compositions that finite datasets cannot exhaustively capture [,,]. This inherent gap between training and real-world operational data critically exacerbates domain shift, profoundly undermining model generalization and leading to sharp performance declines in practical applications. This degradation carries severe security implications: systems may generate false alarms under benign environmental variations or, more critically, fail to detect genuine threats, potentially enabling exploitation by malicious actors. Addressing this pervasive domain shift is therefore paramount for developing reliable and robust VD technologies.

To mitigate this pervasive domain shift, we introduce CoMT-VD, a new domain adaptation framework specifically designed for violence detection. Our primary objective is to adequately transfer and adapt knowledge from labeled source domains to unlabeled target domains [,,], thereby notably enhancing generalization ability. CoMT-VD strategically leverages the Mean Teacher (MT) architecture [,], comprising the student

θ_{S}

network and teacher one

θ_{T}

with identical structures, where

θ_{T}

is updated via Exponential Moving Average (EMA) of

θ_{S}

. This design choice is fundamental because the EMA mechanism empowers the teacher network to generate highly stable and reliable pseudo-labels for sufficient unlabeled target data. These stable pseudo-labels, in conjunction with a consistency loss, are then critically utilized to guide

θ_{S}

training, compelling it to align its predictions with

θ_{T}

and thereby facilitating essential adaptation to diverse domain variations. While data augmentation [,,] further supports this adaptation by simulating real-world visual diversity, the inherent noise stemming from imperfect pseudo-labels generated by the MT setup necessitates a more robust feature refinement strategy. To address this crucial need and purify the learned feature space [,], we propose and integrate a dual-strategy contrastive learning (DCL) module. DCL meticulously refines feature representations through two complementary objectives. First, intra-sample consistency compels

θ_{S}

to produce highly similar features for differently augmented views of the same input. This ensures that learned features are robust to superficial visual changes and capture the essential, invariant characteristics of violence. Second, inter-sample consistency enhances semantic coherence. It actively maps features of samples with similar violent patterns closer together while maximizing distance for dissimilar categories, thereby sharpening decision boundaries. This synergistic integrated approach allows for the Mean Teacher framework to adequately drive domain adaptation [,,], simultaneously ensuring that DCL purifies the learned feature space by enforcing robustness to nuisance visual variations and notably enhancing categorical discriminability. This process adequately mitigates the detrimental impact of potentially noisy pseudo-labels and ensures that the student network

θ_{S}

learns highly discriminative yet domain-invariant features, ultimately leading to improved generalization in violence detection across shifting domains.

The main contributions of this paper are summarized as follows:

We present a pioneering investigation into the pervasive domain shift challenges in violence detection. To adequately address this, we propose CoMT-VD, a new contrastive Mean Teacher model specifically designed to enhance model adaptability and performance across diverse target domain distributions.
We introduce a new dual-strategy contrastive learning (DCL) module that integrates two distinct positive-negative pair matching strategies to compute complementary consistency losses. This promotes the learning of more effective discriminative features for violence detection under challenging domain shift conditions.
We conduct comprehensive evaluations of detecting violence under challenging domain shift scenarios, which unequivocally demonstrate consistent and notable performance improvements when CoMT-VD is integrated with various baseline models.

In the following sections, we will introduce current research on violence detection in the Related Work section, along with relevant studies involving domain adaptation, contrastive learning, knowledge distillation, and other related methodologies that this paper will address. In the Proposed Method section, we will provide a detailed explanation of the CoMT-VD training strategy we have developed. In the Experiments and Ablation Studies sections, we will demonstrate how the proposed CoMT-VD enhances model performance in dealing with domain shift issues. Finally, in the Conclusions section and Future Work sub-section, we will summarize the contributions of this paper and discuss the limitations of the current research.

3. Proposed Method

This section presents our proposed approach to violence detection, specifically designed to address key challenges such as domain shift. we introduce the Contrastive Mean Teacher Violence Detection (CoMT-VD) framework, detailing its overall architecture and elaborating on the motivation and design of its core components: the teacher-student paradigm, a dual-strategy contrastive learning mechanism, and the Mean Teacher optimization procedure.

3.1. Problem Definition

Let

X

denote the input space, representing video segments. Each video segment can be associated with a label from the label space

Y = {0, 1}

, where

y = 1

signifies the presence of violence and

y = 0

signifies its absence. The source domain

D_{s o u}

, consists of

| D_{s o u} |

samples and is formally represented as

{(x_{i}^{s o u}, y_{i}^{s o u})}_{i = 1}^{N_{s o u}}

, where

x_{i}^{s o u} \in X

is the i-th video segment from the source domain and

y_{i}^{s o u} \in Y

is its corresponding ground-truth label. In addition, we have access to a target domain, denoted as

D_{t a r}

, with

| D_{t a r} |

samples, represented as

D_{t a r} = {x_{j}^{t a r}}_{j = 1}^{N_{t a r}}

, where

x_{j}^{t a r} \in X

is the j-th video segment from the target domain. Crucially, the labels for the target domain data are often unavailable or only sparsely available, posing a notable challenge.

The core problem addressed is that of cross-domain violence detection. We aim to learn a model

f : X \to Y

, which generalizes to the target domain, even when there is a domain shift between the source

D_{s o u}

and target distribution

D_{t a r}

. The objective is to optimize f to learn domain-invariant features that are also highly discriminative for the violence detection task. Our approach seeks to achieve this by adequately leveraging any available unlabeled target data and by enhancing the model’s capacity to learn from augmented counterparts of the input data, thereby bridging the gap between domains.

3.2. Contrastive Mean Teacher Violence Detection

To achieve this, we propose the Contrastive Mean Teacher Violence Detection (CoMT-VD) model, which synergistically combines semi-supervised learning via the Mean Teacher paradigm with contrastive learning’s robust representation capabilities, further refined using a cross-domain augmentation strategy.

3.2.1. Method Overview

The CoMT-VD model, depicted in Figure 2, leverages a teacher–student framework [], where both teacher and student models, w.r.t.

f_{T}

,

f_{S}

, parameterized by

θ_{T}

and

θ_{S}

, shares the same network architecture. One critical aspect of our CoMT-VD is the use of cross-domain augmentation: the student model is fed data (from source and target domains) processed by the strong augmenter

A_{s t r o n g}

, while the teacher model receives data processed by the weak augmenter

A_{w e a k}

. As shown in Figure 2, labeled source domain

D_{s o u}

data undergoes the strong augmenter

A_{s t r o n g}

to

{{\bar{x}}_{k}^{s o u}}_{k = 1}^{K}

and is fed to the student network

θ_{S}

for supervised learning via

L_{\sup}

. Unlabeled target domain

D_{t a r}

data is processed through the weak augmenter

A_{w e a k}

to

{{\tilde{x}}_{k}^{t a r}}_{k = 1}^{K}

for the teacher

θ_{T}

and the strong augmenter to

{{\tilde{x}}_{k}^{t a r}}_{k = 1}^{K}

for the student. The teacher, updated via exponential moving average (EMA), provides pseudo-labels from

{{\tilde{x}}_{k}^{t a r}}_{k = 1}^{K}

to guide

θ_{S}

through a self-supervised consistency loss

L_{self}

. The strong-augmented target samples

{{\bar{x}}_{k}^{t a r}}_{k = 1}^{K}

are also input to

θ_{S}

to extract features for the dual-strategy contrastive learning (DCL) module, which applies a contrastive loss

L_{con}

to learn invariant representations. The framework is optimized with a composite loss with

L_{\sup}

,

L_{self}

, and

L_{con}

, enabling effective cross-domain violence detection.

Figure 2. Overview of the proposed CoMT-VD framework. To enable the model to tackle the domain shift problem, we fine-tuned the pre-trained model using the Mean Teacher framework. At the same time, to ensure that the student network can more effectively leverage the knowledge learned by the teacher network, we introduced a contrastive learning module during the fine-tuning process. Furthermore, to enhance the reliability of the knowledge acquired by the contrastive learning module, we designed a novel DCL contrastive learning approach. The blue line in the figure represents the data flow direction of the self-supervised branch, and the red part shows the data flow direction of the supervised branch.

The student model

f_{S}

learns by minimizing a supervised loss on strongly augmented labeled source data. Furthermore, a consistency loss is enforced between the student’s predictions on strongly augmented data and the teacher’s predictions on weakly augmented data, applied to samples from both domains. This encourages the student to learn representations invariant to severe perturbations while being guided by more stable targets from the teacher

f_{T}

. CoMT-VD integrates a dual-strategy contrastive learning (DCL) module that leverages student-extracted features (from strong augmentations) and teacher-extracted features (from weak augmentations) to foster domain-invariant yet discriminative representations. The teacher model updates via exponential moving average (EMA), ensuring training stability and continuous knowledge distillation. This unified framework effectively mitigates cross-domain distribution discrepancies, yielding a violence detection system robust to domain shifts.

3.2.2. Teacher–Student Framework

The student model

f_{S} (\cdot; θ_{S})

learns through direct back-propagation using supervised loss of

L_{s u p}

. In contrast, the parameter of the teacher model

θ_{T}

are updated as EMA of the student

θ_{T}

, creating a temporally ensembled, more stable version of the student. The key distinction of CoMT-VD is how data is presented to each model. The student model is trained on aggressively augmented data to learn robust features. In addition, the teacher model, receiving mildly augmented data, provides more consistent and reliable supervisory signals, especially for unlabeled target data and for regularizing the learning process of the student branch.

The student model learns from labeled source data

x^{s o u}

via a supervised classification loss

L_{s u p}

(e.g., binary cross-entropy), applied to strongly augmented source samples:

L_{\sup} = \frac{1}{| D_{s o u} |} \sum_{x^{t a r} \in D_{s o u}} [y^{s o u} log g (f_{S} (A_{s t r o n g} (x^{s o u})))] .

(1)

Here,

{\bar{x}}_{k, i}^{s o u} = A_{s t r o n g} (x_{i}^{s o u})

is k-th augmentation from the strong augmenter

A_{s t r o n g}

of i-th sample the source domain. g is the detection head.

Additionally, a consistency loss

L_{c o n s}

(e.g., Mean Squared Error) is applied between the predictions of the student on strongly augmented inputs and the teacher on weakly augmented inputs. This loss is computed for all samples

x_{i}

in a batch, which can include data

D_{t a r}

:

L_{self} = \frac{1}{| D_{t a r} |} \sum_{x^{t a r} \in D_{t a r}} ∥ f_{S} (A_{s t r o n g} (x^{t a r})) - f_{T} (A_{w e a k} (x^{t a r})) ∥_{2}^{2},

(2)

where B is the batch size. This forces the student to produce consistent predictions, even under strong perturbations, aligning with the teacher’s more stable view of the data.

3.2.3. Cross-Domain Augmentation

Integral to the efficacy and robustness of our CoMT-VD framework is a carefully designed cross-domain augmentation pipeline. Far from being a mere data preprocessing step, it serves as a strategic component engineered to (1) fortify the student network

θ_{S}

against challenging domain shifts, (2) provide diverse input views essential for dual-strategy contrastive learning to acquire invariant representations, and (3) ensure the stability of supervisory signals generated by the teacher network

θ_{T}

. To achieve these objectives, we divide the augmentation process into two distinct modalities: a weak augmenter

A_{w e a k}

, which preserves semantic content to support reliable pseudo-labeling by the teacher, and a strong augmenter

A_{s t r o n g}

, which introduces substantial visual variation to enhance the student’s resilience and promote distinctive feature learning.

Weak Augmenter: This is denoted as the operator

A_{w e a k}

. Its core tenet is the generation of K minimally perturbed view

{{\tilde{x}}_{k}}_{k = 1}^{K} = A_{w e a k} (x)

, which retains high semantic fidelity to the original sample. This conservative approach is paramount because

\tilde{x}

serves as the input to the teacher network

θ_{T}

, whose outputs (i.e., pseudo-labels) should be stable and reliable to guide the student’s optimization on unlabeled data. By minimizing aggressive counterparts,

A_{w e a k}

ensures that the teacher’s predictions are not confounded by augmentation-induced artifacts, thereby fostering a more dependable knowledge transfer. We note that

A_{w e a k}

concretely includes the following:

Minor Occlusion: Small, randomly positioned occluding patches are introduced. Given an input $x$ and a binary mask $m_{τ_{w}}$ defining these minor occlusions [,], the augmented sample is $\tilde{x} = x ⊙ (1 - m_{τ_{w}})$ , where $m_{τ_{w}} \sim Bernoulli (τ_{w})$ , and $τ_{w}$ is set $0.1$ in our experiment.
Subtle Brightness Adjustment: Pixel intensities are moderately modulated, for example, $\tilde{x} = γ_{w} \cdot x$ , where the brightness factor $γ_{w}$ is sampled from a narrow interval of $[0.9, 1.1]$ .
Gentle Frame Blending: Subtle temporal alterations or mild blending with non-violent frames $x_{n v}$ are performed using a minimal blending coefficient $λ_{w} = 0.1$ , ensuring that the dominant pixel remains unaltered: $\tilde{x} = (1 - γ_{w}) x + γ_{w} x_{n v}$ .

Given the target domain sample

x^{t a r}

, the corresponding weak-augmented results

{{\tilde{x}}_{k}^{t a r}}_{k = 1}^{K}

can be obtained:

{\tilde{x}}_{k}^{t a r} = A_{w e a k} (x^{t a r}), k = 1, 2, \dots, K .

(3)

Then,

{{\tilde{x}}_{k}^{s o u}}_{k = 1}^{K}

will be input into the teacher network

θ_{T}

, to generate high-quality pseudo-labels, which is fundamental for the student’s robust adaptation to the target domain.

Strong Augmenter. In contrast, the strong augmenter

A_{s t r o n g}

is designed to generate substantially diversified and challenging views of the input data. It serves two crucial purposes: (1) It is applied to labeled source samples

x^{s o u} \in D_{s o u}

to train a student

θ_{S}

that is robust to various visual perturbations using self-supervised learning in Equation (2). (2) It is applied to unlabeled target samples

x^{t a r} \in D_{t a r}

to produce K augmentations. These strong-augmented target data are then processed by the student network

θ_{S}

and are the primary inputs for DCL module, compelling it to learn features that are invariant to drastic appearance changes, yet retain semantic discriminability. Concretely,

A_{s t r o n g}

is characterized by significant intensity, broader parameter ranges, as follows:

Significant Occlusion: Larger or more strategically disruptive occlusion masks $m_{τ_{s}}$ are employed: $\bar{x} = x ⊙ (1 - m_{τ_{s}})$ , with $m_{τ_{s}} \sim Bernoulli (τ_{s})$ and $τ_{s} = 0.3$ .
Major Brightness and Contrast Shifts: Pixel intensities and contrast are altered dramatically. For instance, brightness [,] might be scaled by $\bar{x} = γ_{s} \cdot x$ , where $γ_{s}$ is sampled from a wider range $[0.5, 1.5]$ , and contrast adjustments are similarly intensified to simulate challenging real-world lighting conditions (i.e., very dark or overexposed scenes).
Aggressive Frame Blending and Temporal Manipulation: More profound temporal alterations, such as significant frame shuffling [], or aggressive blending [] with disparate scenes $x_{n v}$ (including non-violent content or noise) using a substantial blending factor $λ_{s}$ : $\bar{x} = (1 - λ_{s}) x + λ_{s} x_{n v}$ , where $λ_{s} = 0.4$ .

The rationale for the strong augmenter is to construct a challenging learning crucible for

θ_{S}

. We note that the strong augmenter

A_{s t r o n g}

receives the samples

x^{t a r}

and

x^{s o u}

from both target domain and the source domain, and yields the following conterparts

{{\bar{x}}^{t a r}}_{k = 1}^{K}

and

{{\bar{x}}^{s o u}}_{k = 1}^{K}

:

\begin{matrix} {\bar{x}}_{k}^{t a r} & = A_{s t r o n g} (x^{t a r}), k = 1, 2, \dots, K, \end{matrix}

(4)

\begin{matrix} {\bar{x}}_{k}^{s o u} & = A_{s t r o n g} (x^{s o u}), k = 1, 2, \dots, K . \end{matrix}

(5)

By forcing the student to discern invariant characteristics across these radically different views of the same underlying instance (especially for target data in the DCL module) and across diverse source instances, we cultivate feature representations that are not only robust to superficial visual changes, but are also highly discriminative of the core semantic content pertaining to violence. This is indispensable for overcoming domain-specific idiosyncrasies and achieving superior generalization.

This strong-weak augmentation strategy [] establishes a symbiotic interaction within CoMT-VD, effectively addressing domain shift through reliable supervision from the Mean Teacher and DCL’s refinement of discriminative, domain-agnostic features. This synergy forms the foundation of our framework’s enhanced adaptability and performance across diverse violence detection scenarios.

3.2.4. Dual-Strategy Contrastive Learning

To align samples from different distributions, we propose a new contrastive learning mechanism with two key components: Intra-Sample Consistency: Treat a sample and its differently augmented versions as a positive pair, while considering all other samples as negative pairs. Inter-Sample Consistency: To identify positive pairs of the same class from different samples within a batch, we regard samples with similarity scores exceeding

δ^{u p p e r}

as positive pairs. To better distinguish samples from different classes, samples with similarity scores below

δ^{l o w e r}

within the same batch are treated as negative pairs. In this Figure,

{\hat{z}}_{k}^{i}

is the feature map obtained after sample

x_{k}^{i}

undergoes the strong augmenter and is fed into the encoder in the student network;

{\bar{z}}_{k}^{i}

is the feature map obtained after sample

x_{k}^{i}

undergoes weak augmentation and is fed into the teacher encoder in the teacher network. Here, K represents the number of augmentations. Contrastive learning enables the unsupervised approach of pulling closer the distance between samples of the same category in the input space while repelling samples from different categories [,]. While this mechanism is commonly applied in few-shot learning, our work draws inspiration from the core principle of contrastive learning: pulling similar samples closer and pushing dissimilar ones apart in the feature space. As illustrated in Figure 2, we propose a dual positive-negative pair matching strategy to compute the NT-Xent loss (Normalized Temperature-scaled Cross Entropy) [,] for both positive and negative sample pairs. We call the module dual-strategy contrastive learning (DCL). The input contains

2 K

augmented samples by the weak augmenter and strong augmenter. These samples first undergo positive-negative pair matching via the Pair Matcher in DCL, followed by the computation of the NT-Xent loss using these pairs to achieve cross-domain sample consistency alignment. The Pair Matcher employs two distinct positive-negative pair matching methods to perform consistency alignment:

Intra-Sample Consistency: To enable the model to maintain feature invariance of the same sample across different scenarios, we construct positive pairs by pairing the augmented features of a sample with those extracted from other augmentation methods applied to the same sample, while treating features from other samples as negative pairs. This ensures that the model preserves consistency in extracted features across varying environments while enhancing its robustness to noise and perturbations.
Inter-Sample Consistency: Normal samples consistency establishes positive pairs by matching a sample with its differently augmented views, while treating samples from other instances as negatives—a classic positive-negative pairing strategy. This encourages the network to preserve feature consistency across diverse transformations and real-world variations. However, relying solely on this approach may under-utilize valuable semantic relationships among distinct instances that share the same class label. To fully exploit the diverse information emerging from different views generated by the same sample within a batch, we design an intra-sample feature pairing strategy.

To address this limitation, we introduce Inter-Sample Consistency, which aligns feature representations of different samples belonging to the same class. This not only improves the utilization of intra-class diversity, but also promotes better alignment of heterogeneous yet semantically similar instances, thereby enhancing the compactness and discriminability of action-level features.

As show in Figure 3, we input K strong augmentations

{{\bar{x}}^{t a r}}_{k = 1}^{K}

from target domain

D_{t a r}

into the DCL, where each sample

{\bar{x}}_{k}^{t a r}

is a strong augmentation of target domain using

A_{s t r o n g}

. Also, K weak augmentations

{{\tilde{x}}^{t a r}}_{k = 1}^{K}

from

A_{w e a k}

are input. Then, we encode each

{\bar{x}}_{k}^{t a r}

by student network

f_{S} (\cdot; θ_{S})

, and encode each

{\tilde{x}}_{k}^{t a r}

by teacher network

f_{T} (\cdot; θ_{T})

to obtain the feature set

\bar{Z} = {\bar{z}}_{k = 1}^{K}

and

\tilde{Z} = {\tilde{z}}_{k = 1}^{K}

. We note that each feature

{\bar{z}}_{k}

,

{\tilde{z}}_{k}

will be used to compute the similarity within the corresponding feature set. Taking a strong augmentation sample pair

{\bar{z}}_{m}

and

{\bar{z}}_{n}

as an example, the similarity is calculated by cosine similarity:

c o s_s i m ({\bar{z}}_{m}, {\bar{z}}_{n}) = \frac{{\bar{z}}_{m} \cdot {\bar{z}}_{n}}{∥ {\bar{z}}_{m} ∥ ∥ {\bar{z}}_{n} ∥} .

(6)

If the

c o s_s i m ({\bar{z}}_{m}^{i}, {\bar{z}}_{n}^{j}) > δ^{u p p e r}

, we regard

{\bar{z}}_{m}

and

{\bar{z}}_{n}

as a positive pair. In contrast, if

c o s_s i m ({\bar{z}}_{m}, {\bar{z}}_{n}) < δ^{l o w e r}

, the

{\bar{z}}_{m}

and

{\bar{z}}_{n}

are treated as a negative pair. In our experiments,

δ^{u p p e r} = 0.7

is the positive critical threshold and

δ^{l o w e r} = 0.3

is the negative critical threshold. In the dual-strategy contrastive learning module, for a batch of size K, the teacher network generates K weak augmented samples and the student network generates K strong augmented samples in the unsupervised branch. These 2K samples are then paired as positive and negative sample pairs. All other sample features are disregarded, and we set these similarities to 0. For an augmented feature

z_{k}^{'} \in \bar{z} \cup \tilde{Z}

derived from a sample, its positive pair set is denoted as

P (z_{k}^{'})

. We use the NT-Xent loss [,] to compute the contrastive loss

L_{con}

:

L_{con} = - \frac{1}{2 K} \sum_{k = 1}^{2 K} log \frac{\sum_{z_{p} \in P (z_{k}^{'})} exp (c o s_s i m (z_{k}^{'}, z_{p}) / τ)}{\sum_{z_{q} \notin P (z_{k}^{'})} exp (c o s_s i m (z_{k}^{'}, z_{q}) / τ)},

(7)

where

τ

is the temperature scaling factor. We are consistent with the best performance in [], and set

τ

= 0.1.

P (z_{k}^{'})

represent positive pair sets for the

z_{k}^{'}

, and

z_{p}

represents the positive feature for

z_{k}^{'}

in the

P (z_{k}^{'})

, and

z_{q}

represents the negative sample of

P (z_{k}^{'})

, which is not in P and have a similarity

δ^{l o w e r} < 0.3

with

z_{k}^{'}

.

Figure 3. Illustration of dual-strategy contrastive learning (DCL). To align samples from different distributions, we propose a new contrastive learning mechanism with two key components: Intra-Sample Consistency and Inter-Sample Consistency.

3.2.5. Mean Teacher Optimization

The Mean Teacher architecture enhances semi-supervised learning by leveraging the stability of the teacher model

θ_{T}

to reduce noise in pseudo-labels []. To ensure that the teacher network acquires richer knowledge while maintaining stability, the weakly augmented samples

{\tilde{x}}_{k}^{t a r} = A_{w e a k (x^{t a r})}

are fed into the teacher branch. This approach enables the model to learn sample features across diverse scenarios while preserving the teacher model’s stability. As a result, the teacher generates more reliable pseudo-labels to guide the parameters updating of the student network

θ_{S}

. The parameters

θ_{S}

of student network

f_{S} (\cdot; θ_{S})

is updated using gradient descent:

θ_{S} \leftarrow θ_{S} - α \cdot \nabla_{θ_{S}} L_{CoMT},

(8)

where

α = 0.001

is the learning rate. In addition,

L_{CoMT}

is the total loss function of the CoMT-VD framework, which is formulated as the combination of three components, i.e., the supervised loss

L_{\sup}

, self-supervised loss

L_{self}

, and contrastive loss

L_{con}

in Equation (7), expressed as

L_{CoMT} = λ_{1} L_{\sup} + λ_{2} L_{self} + λ_{3} L_{con},

(9)

where

λ_{1} = 1

,

λ_{2} = 0.5

, and

λ_{3} = 0.3

are the weights of these three loss terms.

Meanwhile, the parameters

θ_{T}

of the teacher branch are updated via

θ_{S}

by Exponential Moving Average (EMA):

θ_{T} \leftarrow η θ_{T} + (1 - η) θ_{S} .

(10)

Here,

η

is the momentum factor, which is recommended to set to

0.95

during deployment. The overall training procedure of the proposed CoMT-VD (Contrastive Mean Teacher for Violence Detection) model is outlined in Algorithm 1.

Algorithm 1 Training Procedure of CoMT-VD

Require: Labeled source data $D_{s o u} = {(x_{i}^{s o u}, y_{i}^{s o u})}$ ; Unlabeled target data $D_{t a r} = {x^{t a r}}$ ; Pre-trained model $f_{θ}$ ; Output: Optimized teacher $f_{θ_{T}}$ and student model $f_{θ_{S}}$ ;

1. Initialize the student network

f_{θ_{S}} \leftarrow f_{θ}

and the teacher network

f_{θ_{T}} \leftarrow f_{θ}

;

2. For each

x_{i}^{s o u} \in D_{s o u}

and

x_{i}^{t a r} \in D_{t a r}

do:

⊳ generate the strong augmentations

3.

{{\bar{x}}_{k}^{s o u}}_{k = 1}^{K} = A_{s t r o n g} (x^{s o u})

;

4.

{{\bar{x}}_{k}^{t a r}}_{k = 1}^{K} = A_{s t r o n g} (x^{t a r})

;

▹ generate the weak augmentations

5.

{{\tilde{x}}_{k}^{t a r}}_{k = 1}^{K} = A_{w e a k} (x^{t a r})

;

⊳ compute the feature of the strong augmented samples

6.

{\bar{z}}_{k}^{K} = f_{θ_{T}} ({{\bar{x}}_{k}^{t a r}}_{k = 1}^{K})

and

{\tilde{z}}_{k}^{K} = f_{θ_{T}} ({{\tilde{x}}_{k}^{t a r}}_{k = 1}^{K})

;

7. Calculate

L_{\sup}

using Equation (1);

L_{\sup} = \frac{1}{| D_{s o u} |} \sum_{x^{t a r} \in D_{s o u}} [y^{s o u} log g (f_{S} (A_{s t r o n g} (x^{s o u})))]

;

8. Calculate

L_{self}

via Equation (2);

9. Calculate

L_{con}

via Equation (7);

10. Calculate

L_{CoMT} = L_{\sup} + L_{self} + L_{con}

;

11. Update

θ_{S}

using Equation (8);

12. Update

θ_{T}

using Equation (10);

13. End For

4. Experiments

We implement our proposed model based on TensorFlow 2.3.0, and all experiments are deployed on a machine with a 24 GB NVIDIA A10 GPU. In this section, we firstly introduce our experimental datasets, then we will present the experimental setup and results of the model. In this paper, all of the experiments adopted the classic frame sampling method in the violence detection [,,,], sampling 16 frames uniformly from each video. In this paper, all training runs were conducted with the random seed fixed at 42, and the best-performing model was selected after 200 epochs of training. Following this, we conduct a further ablation analysis, and, finally, we analyze our proposed model based on the experimental results.

4.1. Benchmark Datasets

In this paper, we verify the performance of our proposed model on five well-known datasets:

RLVS [] is a large-scale real-world captured dataset, containing 1000 violent and 1000 non-violent video clips. The video data in RLVS features more diverse scenarios including streets, classrooms, courtyards, corridors, and sports fields. Most footage is captured from a third-person perspective, predominantly showing fights involving multiple individuals, with a small number of group violence scenes.
RWF-2000 [] is designed for real-world violent behavior detection under surveillance cameras. The videos in this dataset are collected from raw surveillance footage on YouTube, segmented into clips of up to 5 s at 30 frames per second, with each clip labeled as either violent or non-violent behavior.
Hockey Fight [] comprises 1000 real video clips from ice hockey matches, evenly split between 500 fight scenes and 500 normal segments in the hockey games. However, a notable limitation of this dataset is the lack of scene diversity, as all videos are confined to ice hockey rinks during matches.
Movies [] is relatively small-scale, consisting of 200 video clips from action movies, with 100 clips depicting violent scenes and 100 non-violent scenes.
Violent Flow [] describes violent or non-violent group behaviors in real-world scenarios. The samples originates from surveillance cameras and monitoring devices, capturing large-scale crowd behaviors in public spaces such as stadiums, streets, and squares. The videos are divided into two categories, violent video and normal video, with a total of 246 video clips (123 violent and 123 non-violent clips).

4.2. Baselines

Our CoMT-VD framework can be integrated with any existing violence detection models. Therefore, to validate the generalization capability of our proposed CoMT-VD, we conduct experiments on 5 different models categorized into three groups based on their architectures: (1) 3DCNN-based: Song et al. [] extends the C3D network proposed in [] (originally designed for human action recognition) to the task of violent behavior recognition. Zhenhua et al. [] proposes full-temporal fusion violence detection model (SCTF). (2) 2DCNN-based: Soliman et al. [] uses 2DCNN to extract spatial features from each frame and employs an LSTM network for temporal feature fusion, called VGG16 + LSTM. Wang et al. [] proposes a plug-and-play module (ActionNet) for human action recognition networks, which is also used to the violence detection task. (3) Transformer-based: Arnab et al. [] introduces a pure Transformer encoder for video understanding, replacing conventional 3D convolutions in action recognition tasks.

4.3. Experimental Setup

To comprehensively evaluate the efficacy of our proposed CoMT-VD framework, we designed two distinct experimental setups, each addressing a critical aspect of real-world violence detection challenges: handling unseen environmental conditions and generalizing to variational datasets. In all experiments, inputs consist of 16 video frames uniformly segmented from the video, with each frame resized to a resolution of

224 \times 224

.

(1) Setup-1: Performance under Simulated Variational Conditions: This setup is meticulously designed to assess CoMT-VD’s robustness and generalization abilities when confronted with data corrupted by various challenging, yet common, environmental factors not explicitly seen during standard training. Our objective is to simulate real-world unseen scenarios to validate the performance improvement of CoMT-VD on data exhibiting variational distributions. We introduce four challenging conditions, occlusion, fog, rain, and low-light environments, as shown in Figure 4.

Figure 4. Instances of domain shift under simulated real-world conditions: Occlusions, Rainy Scene, Foggy Scene, and low-illumination environments. These four examples are examples where the baseline model tends to yield detection errors. By combining CoMT VD, all models will obtain more accurate detection results.

Rain Simulation: Gaussian noise is strategically injected into the original images, with elongated noise points designed to simulate realistic raindrops. Raindrop parameters are precisely set to a length of 10 pixels and a count of 500 per frame, ensuring a consistent and reproducible simulation of rainy scenes [].
Fog Simulation: A fog-like color template [] (fixed at an RGB value of 200) is overlaid onto the original image at 50% intensity. This process emulates the reduced visibility and contrast of foggy conditions.
Low-Light Simulation: The brightness of each frame is systematically reduced to 40–70% of its original level. This simulates varying degrees of low-light conditions, a frequent challenge in surveillance and monitoring applications.
Occlusion Simulation: To mimic real-world occlusions [,], randomly positioned rectangular masks, with widths and heights sampled from the range of $[30, 70]$ pixels, are applied to each frame within the test videos. This simulates partial obstructions that can severely impact visual cues.

For this setup, five baseline models are evaluated across five benchmark datasets. For each dataset, 80% of the data is allocated for training and 20% for testing. Crucially, within our proposed CoMT-VD, the training set is further partitioned, with 20% designated as source domain data and 80% as target domain data. We conduct a rigorous comparative analysis of baseline model both before and after the integration of CoMT-VD to empirically validate its effectiveness in mitigating the impact of these unseen conditions.

(2) Setup-2: Generalization to Variational Datasets: This experimental setup focuses on evaluating the cross-scenario generalization efficacy. Here, the five baselines are initially trained on widely recognized real-world violence detection datasets, namely RLVS [] and RWF-2000 []. Subsequently, their performance is rigorously evaluated on distinct, scenario-specific benchmarks: Hockey Fight [] and Movies []. To ensure comparative validity and a fair assessment, baseline counterparts augmented with our proposed training strategy undergo identical evaluation protocols. This testing paradigm quantitatively assesses the detection accuracy improvements, thereby systematically validating the methodology’s ability to generalize adequately to entirely variational datasets, a critical requirement for real-world applicability.

In all experiments presented in this paper, the input consists of 16 video frames uniformly segmented from the video, with each frame resized to a resolution of

224 \times 224

.

4.4. Experimental Results

4.4.1. Performance for Unseen Conditions

In real-world scenarios (such as the video obtained from security cameras without modification by multimedia technology), there are often uncertain dynamic environmental changes (e.g., occlusion, low illumination, rainy, foggy weather, etc.) that involve unseen data not encountered during training, leading to performance degradation of models. This experiment aims to validate the capability of our proposed CoMT in handling such unseen datasets under environmental variations. We train five baseline models on five benchmark datasets, respectively, using 80% of the data for training and 20% for testing. The test data simulates challenging scenarios including rainy, foggy, low-light, and occluded conditions to evaluate the models’ generalization capability on unseen situations potentially encountered in real-world deployments. To validate the effectiveness of the Contrastive Mean Teacher (CoMT), we integrate CoMT into each of the five pre-trained baseline models for fine-tuning, subsequently testing their adaptability and robustness in previously unseen extreme scenarios. Table 1 presents the performance comparisons between the baseline models and those integrated with CoMT-VD across five benchmark datasets for rainy scenarios, foggy scenarios, low-light scenarios, and occlusion scenarios, respectively. It can be observed that the incorporation of CoMT-VD leads to a notable improvement in the model’s performance on unseen scenarios. We conduct the experiment in the settings consistent with Setup-1. The results show that the introduction of CoMT-VD can enhance the generalization capability of the models, improving its detection performance on unseen data. The baseline models achieve improvements ranging from 5.0∼12.0%, 6.0∼12.5%, 5.0∼11.2%, 5.0∼11.2%, and 6.3∼12.3% across the five benchmark datasets in four different domain scenarios, respectively. We note that the results demonstrate that CoMT-VD consistently enhances the capability of various violence detection models for test samples with variational distributions.

Table 1. Volence detection performance (accuracy %) across five dataset under different domain shift scenarios: rainy, foggy, low-light, and occlusion scenes. The numbers in parentheses (blue color) indicate the absolute performance gains (percentage points) achieved by CoMT-VD over vanilla baseline models.

4.4.2. Performance for Cross Datasets

To validate CoMT’s ability to enhance model generalization and improve domain adaptation capability, we designed this experiment to demonstrate that models trained on real-world datasets (RLVS, RWF-2000) show improved domain adaptation and generalization on specific-scenario violence detection datasets (Hockey Fight, Movies). The experimental setup during training remained consistent with Section 4.4.1, using real-world datasets RLVS and RWF individually, as well as a mixed dataset combining RLVS and RWF-2000 for training. Testing was conducted on the Hockey Fight and Movies datasets. The Hockey Fight dataset contains violent behaviors occurring in ice hockey arenas, where all subjects wear protective gear. While the dataset exhibits notable variations in scene configurations and subject appearances, the characteristic patterns of violent behaviors remain consistent. The Movies dataset contains numerous exaggerated violent actions from a movie perspective, which differs from the distribution of daily violent behaviors in the training data. As shown in Table 2 and Table 3, baseline models incorporating the CoMT strategy achieve remarkable performance improvements. The results indicate that these baselines can enhance cross-domain detection capability when trained with CoMT-VD, specifically improving the models trained on real-world data by 5.1∼9.2% on the Hockey Fight test set and 5.0∼12.0% on the test set of the Movies dataset. Experimental results demonstrate that incorporating CoMT adequately enhances the model’s generalization performance in cross-domain detection under such specialized scenarios

Table 2. Performance (accuracy %) of models trained on real-world datasets test on the Hockey Fight dataset under Setup-1. We employ real-world datasets (RLVS and RWF-2000, both of them come from surveillance videos, which include indoor/outdoor fights and daily violent behaviors) as training data and evaluated on the Hockey Fight dataset to validate the model’s generalization capability for cross-domain detection. The numbers in parentheses (blue color) indicate the absolute performance gains (percentage points) achieved by CoMT-VD over vanilla baseline models.

Table 3. Performance of models trained on real-world datasets test on the Movies dataset under Setup-2. We employ real-world datasets (RLVS and RWF-2000, both of them come from surveillance videos, which include indoor/outdoor fights and daily violent behaviors) as training data and evaluated on the Movies dataset to validate the model’s generalization capability for cross-domain detection. The numbers in parentheses (blue color) indicate the absolute performance gains (percentage points) achieved by CoMT-VD over vanilla baseline models.

5. Ablation Studies

Finally, we investigate the impact of different components on the generalization capability of the CoMT-VD model. When we set the brightness

γ_{s} = [0.5, 1.5]

γ_{w} = [0.9 - 1.1]

, occlusion

τ_{w} = 0.1

τ_{s} = 0.3

, Frame Blending

λ_{s} = 0.4

λ_{w} = 0.1

, and similarity threshold

δ^{u p p e r} = 0.7

δ^{l o w e r} = 0.3

parameters, we fine-tuned the model to achieve the best performance. In this section, we introduce the impact of different parameter settings and modules on the model.

5.1. Different Augmentation Intensities

This study employs three distinct data augmentation strategies to imbue training data with multi-scenario noise perturbations, thereby enhancing the model’s adaptability to diverse environmental disturbances. The experiments investigate how varying augmentation intensities affect the model’s detection generalization capability: (1) Brightness Augmentation: We optimize brightness augmentation factors

γ_{s}

and

γ_{w}

through systematic strength calibration to identify optimal parameter values. (2) Occlusion Augmentation: By controlling occlusion ratios

m_{τ_{s}}

and

m_{τ_{w}}

, selecting diverse occlusion box dimensions, we enhance the model’s noise resistance against partial occlusions. The numbers of pixels with occlusion are calculated by

⌊ m_{τ_{s}} ⊙ x ⌋ ⊙ ⌊ m_{τ_{s}} ⊙ x ⌋

,

⌊ m_{τ_{w}} ⊙ x ⌋ ⊙ ⌊ m_{τ_{w}} ⊙ x ⌋

, respectively, where

⌊ \cdot ⌋

represents rounding down to an integer. Ther weak augmenter randomly samples occlusion masks as integers within (0, [

⌊ m_{τ_{w}} ⊙ x ⌋

], while the strong augmentation selects integers from [

⌊ m_{τ_{w}} ⊙ x ⌋

,

m_{τ_{s}} ⊙ x ⌋

]. (3) Temporal Frame Blending: A frame blending rate parameter governs the insertion of adversarial frames into video sequences. Given input dimensions

224 \times 224 \times 16

, the number of blended frames is calculated as

⌊ λ_{n v} \times 16 ⌋

, strengthening temporal feature extraction capabilities. In this section, all experiments maintain consistency with Setup-2 in terms of scenario configurations and data usage. During the testing phase, we amalgamate the four scenarios defined in Setup-2 into composite test data for experimental validation.

Impact of Brightness Adjustment Factor. As demonstrated in Table 4, we explored varying random sampling ranges for the brightness augmentation factor. Optimal results are achieved by configuring strong augmentation with a factor randomly sampled from

[0.5, 1.5]

and weak augmentation from

[0.9, 1.1]

. This reveals two key observations: A clear distinction exists between the ranges of strong and weak brightne augmentation factors. Weak augmentation operates within a narrower, more stable range, whereas strong augmentation notably enhances the model’s generalization capability when employing a broader yet appropriately bounded random sampling interval. The empirical observations demonstrate that both excessive or insufficient degradation levels adversely impact model generalization capability. The bold font indicates the best performance achieved by CoMT-VD across all setups.

Table 4. Performance (accuracy %) with different brightness augmentations. The numbers of red color represent the absolute performance degradation (percentage) lossed by CoMT-VD compared to the best performance. The first column specifies different value combinations of illumination

γ_{s}

(strong augmentation) and

γ_{w}

(weak augmentation) in the illumination-reduction augmentation.

Impact of Occlusion Augmentation Rate. As shown in Table 5, the model achieves optimal performance when the occlusion ratio is set to 0.3 for strong augmentation and 0.1 for weak augmentation. Excessively high occlusion ratios (i.e.,

τ_{s} > 0.3

in strong augmentation) may excessively mask critical features during training, impairing feature extraction capabilities. Conversely, overly conservative ratios (e.g., lowering strong augmentation below

τ_{s} < 0.3

) risk oversimplifying learned representations, thereby degrading generalization ability. Crucially, insufficient differentiation between strong and weak augmentation parameters (e.g.,

0.2

for weak vs.

0.3

for strong and

0.1

for weak vs.

0.2

) leads to dual failures: inadequate knowledge acquisition due to overlapping perturbation intensities and diminished pseudo-label reliability caused by weakened data credibility in weakly augmented samples.

Table 5. Performance (accuracy %) with different occlussion augmentation rates in the strong and weak augmenter. The numbers of red color represent the absolute performance degradation (percentage) lossed by CoMT-VD compared to the best performance. This experiment investigates the impact of occlusion size on model generalization capability. The numbers of pixels with occlusion are calculated by

⌊ m_{τ_{s}} ⊙ x ⌋ ⊙ ⌊ m_{τ_{s}} ⊙ x ⌋

,

\begin{matrix} b r e a k \end{matrix}

⌊ m_{τ_{w}} ⊙ x ⌋ ⊙ ⌊ m_{τ_{w}} ⊙ x ⌋

respectively, where

⌊ \cdot ⌋

represents rounding down to an integer.

Impact of Temporal Frame Blending Rate: This experiment investigates the impact of introducing a cross-modal data mixing strategy within the Mean Teacher framework. In Table 6, by comparing the effects of inserting non-violent frames at different ratios in the strong/weak augmentation branches (

λ_{s} = 0.3

for strong augmentation and

λ_{w} = 0.1

for weak augmentation), we observe that introducing 30% non-violent frames in the strong augmentation branch achieves optimal validation performance. This demonstrates the strategy’s effectiveness in enhancing the model’s discriminative capability against adversarial frames. The high-disturbance features generated by strong augmentation, combined with the semantic interference from non-violent frames, create an adversarial training paradigm that compels the network to prioritize temporal motion features over individual frame appearances. Meanwhile, maintaining a lower 10% mixing ratio in the weak augmentation branch preserves the reliability of pseudo-labels generated by the teacher network. It can be observed that excessive blending frames rate may lead to the replacement of key feature frames, which could degrade model performance. Insufficient distinction between strong and weak augmentations might result in inadequate knowledge acquisition by the model, while unstable weak augmentations could compromise pseudo-label quality and further impair model performance. In this study, we select the parameter combination of and

λ_{s} = 0.4

and

λ_{w} = 0.1

.

Table 6. Performance (%) with different frame blending rate. The numbers of red color represent the absolute performance degradation (percentage) lossed by CoMT-VD compared to the best performance. The number of blended frames is calculated as

⌊ λ_{n v} \times 16 ⌋

, strengthening temporal feature extraction capabilities.

5.2. The Impact of Other Factors on Performance

Impact of Mean Teacher and DCL in CoMT-VD: In this experiment, the encoder employs two baselines: the top-performing SCTF and the classic 3DCNN-based model C3D. We validate the generalization of CoMT-VD on unseen data across five benchmark datasets, specifically the models are tested on the Hockey Fight dataset, to demonstrate CoMT-VD’s capability. Following the experimental setups in Section 4.4.1, we evaluate the impact of CoMT-VD’s components by disabling the Mean Teacher strategy and DCL strategy separately across different baselines. Table 7, Table 8, Table 9 and Table 10 demonstrate that incorporating the Mean Teacher and dual-strategy contrastive learning (DCL) modules can enhance generalization by improving the effectiveness of training samples. While DCL alone—without the Mean Teacher framework—yields moderate gains in generalization, the combination of both components achieves the best performance, highlighting their complementary roles in facilitating cross-domain knowledge transfer. In this experiment, we aggregate unseen data from diverse categories across each dataset into the test set to evaluate the model’s out-of-distribution performance. Figure 5 illustrates the ROC curves of C3D and SCTF in cross-dataset violence detection after training on three source datasets with CoMT-VD. The results show that CoMT-VD improves the Area Under the ROC Curve (AUC), indicating enhanced model performance in balancing true positive and false positive rates when detecting violent behavior.

Table 7. Impact of accuracy by Mean Teacher and DCL for cross-dataset detection. To verify the contributions of both modules, we conduct experiments by selectively disabling each component through setting their corresponding loss terms to zero, thereby validating the effectiveness of both components in CoMT-VD. These experiments are performed by training on three datasets (RLVS, RWF-2000, and RLVS + RWF-2000) and testing on the Hockey Fight dataset, aiming to demonstrate CoMT-VD’s capability in addressing domain shift challenges for cross-dataset detection tasks. The numbers in parentheses (blue color) indicate the absolute performance gains (percentage points) achieved by CoMT-VD over vanilla baseline models. × indicates that the corresponding module has been included, while √ indicates that the corresponding module has not been included.

Table 8. Impact of F1 score by Mean Teacher and DCL for cross-dataset detection. The numbers in parentheses (blue color) indicate the absolute performance gains (percentage points) achieved by CoMT-VD over vanilla baseline models. × indicates that the corresponding module has been included, while √ indicates that the corresponding module has not been included.

Table 9. Validation of the Impact of accuracy by Mean Teacher and DCL on the Models for unseen scenarios dataset violent video detection. We conduct experiments by selectively disabling each component through setting their corresponding loss terms to zero, thereby validating the effectiveness of both components in CoMT-VD. These experiments are performed by training on five benchmark datasets (RLVS, RWF-2000, Hockey Fight, Movies and Violent) respectively, and test the performance on each simulated scenarios of test set to demonstrate CoMT-VD’s capability in addressing domain shift challenges for unseen conditions dataset detection tasks. The numbers in parentheses (blue color) indicate the absolute performance gains (percentage points) achieved by CoMT-VD over vanilla baseline models. × indicates that the corresponding module has been included, while √ indicates that the corresponding module has not been included.

Table 10. Validation of the Impact of F1 score by Mean Teacher and DCL on the Models for unseen scenarios dataset violent video detection. The numbers in parentheses (blue color) indicate the absolute performance gains (percentage points) achieved by CoMT-VD over vanilla baseline models. × indicates that the corresponding module has been included, while √ indicates that the corresponding module has not been included.

Figure 5. ROC curves of C3D and SCTF on cross-dataset voilence detection. The sub-figure (a–c) show the changes in ROC curves of C3D and SCTF models tested on the Hockey Fight dataset before and after incorporating the proposed CoMT-VD framework. The results demonstrate that with CoMT-VD, the AUC values are improved for all models.

Impact of the threshold in DCL: Within the Inter-sample consistency in the proposed DCL, our pair-matching mechanism for positive/negative samples considers feature pairs with similarities above threshold

δ^{u p p e r}

as positive matches (features from different samples but belonging to the same class), and those below threshold

δ^{l o w e r}

as negative matches (features from different samples with distinct classes). This approach ensures that pseudo-labels generated in unlabeled self-supervised learning better approximate the true class distribution. The selection of these pair-matching thresholds becomes crucial. The results in Table 11 indicate that the optimal performance is achieved when

δ^{u p p e r}

= 0.7 and

δ^{l o w e r}

= 0.3. Excessively strict constraints on inter-sample similarity matching can lead to insufficient alignment capability learned from the data, hindering the model’s knowledge acquisition. Conversely, overly lenient constraints may introduce excessive invalid sample pairs, causing the model to learn irrelevant or even counterproductive knowledge.

Table 11. Performance (accuracy %) with different similarity thresholds in DCL. The numbers of red color represent the absolute performance degradation (percentage) lossed by CoMT-VD compared to the best performance. For the proposed DCL, the inter-sample consistency alignment controls the matching of positive and negative pairs through upper-bound

δ^{u p p e r}

and lower-bound

δ^{l o w e r}

. From the result, we observe the combination of

0.7

and

0.3

is the optimal values.

Impact of the weights in the loss function. Equation (9) represents the loss function of CoMT-VD. Since labeled data typically provides more reliable optimization for models in supervised settings, we set

λ_{1} = 1.0

. This experiment aims to explore the effects of varying

λ_{2}

and

λ_{3}

on model performance and identify the optimal combination of

λ_{2}

and

λ_{3}

for CoMT-VD. We integrate CoMT-VD into the pretrained SCTF model and trained it on RLVS, RWF-2000, and their combined dataset (RLVS + RWF-2000). By fine-tuning the model with different combinations and evaluating accuracy on the Hockey Fight and Movies datasets, we select the best parameter configuration via observing the accuracy trend with different

λ_{2}

and

λ_{3}

. Figure 6 demonstrate that the model achieves optimal performance when

λ_{2} = 0.5

and

λ_{3} = 0.3

.

Figure 6. Influence of different parameter combinations with loss function for the model. We fix

λ_{1}

at

1.0

and explore the most suitable combination of

λ_{2}

and

λ_{3}

by varying their values within the range

[0, 1]

with a step size of

0.1

. Using the pre-trained SCTF model fine-tuned with CoMT-VD on RLVS (a), RWF-2000 (b), and their mixed dataset (c), and test different weight combinations on Hockey Fight. For clearer visualization, the figure shows the Gaussian-smoothed accuracy variations of fine-tuned models across different weight configurations. The trend reveals that the model achieves peak performance when

λ_{2} = 0.5

and

λ_{3} = 0.3

.

Performance of other augmentation: In this ablation study, we evaluate classical image augmentation techniques, rotation and cropping, finding they slightly reduce model performance, as summarized in Table 12. For rotation, we implement

30^{\circ}

(strong) and

15^{\circ}

(weak) angles. For cropping, randomly selected dimensions are used with ratios

β = 0.6

(strong) and

β = 0.8

(weak). The performance decrease can be attributed to the inherent translation invariance of CNNs, where rotation fails to alter feature relationships adequately. Furthermore, unlike occlusion, cropping inadvertently obscures critical continuous features in sequential data, leading to inferior performance. This suggests that inappropriate augmentation causes models to learn ineffective features, indicating an upper bound to the knowledge gain from augmented data.

Table 12. Validation of the performance (accuracy %) of other augmentations. The numbers in parentheses (red color) represent the absolute performance degradation (percentage) loss by CoMT-VD compared to the best performance sets.

6. Conclusions

6.1. Discussion

In this paper, we address the critical challenge of domain shift in automated violence detection systems, which notably impedes their reliable deployment in diverse real-world surveillance scenarios. We introduce CoMT-VD, a new Contrastive Mean Teacher-based Violence Detection model, specifically engineered to enhance model adaptability and robustness across unseen target domains. The efficacy of CoMT-VD stems from its innovative integration of a Mean Teacher architecture, which adequately leverages unlabeled target domain data to foster stable, domain-invariant feature representations through consistency regularization. Crucially, to counteract the potential supervisory noise from pseudo-labels and further refine the learned feature space, we incorporate a sophisticated dual-strategy contrastive learning module. It meticulously refines features by enforcing both intra-sample consistency for robust representation learning and inter-sample consistency for sharpened categorical discriminability. Broad evaluations on five benchmark datasets unequivocally demonstrate the superior capacity to substantially improve generalization performance on unseen data and across challenging cross-dataset assessments, marking a notable advancement towards more robust and reliable real-world violence detection systems.

6.2. Limitation

This paper verifies the proposed training strategy for enhancing the model’s ability to handle domain shift problems. Due to the limitation of difficulty in collecting violent behavior datasets in specific scenarios, this study employs the addition of various types of noise to general open source datasets for violent behavior detection, simulating common real-world conditions such as low-light, rainy, foggy, and occluded scenes (as illustrated in Figure 4). However, whether the proposed training strategy can maintain consistent performance under corresponding real-world environmental changes requires further validation in subsequent research using datasets from actual specific scenarios. Additionally, this paper focuses on fine-tuning pre-trained models to improve model performance, but it has limitations in detecting continuous video sequences with fewer than 16 frames, necessitating special processing techniques (such as padding video frames or implementing looped playback). Furthermore, according to the ablation experiments, the model has limitations in detecting videos where the number of frames containing violent behavior is extremely low.

6.3. Future Work

For future work, we plan to explore several promising directions, including the integration of more advanced spatiotemporal backbone networks to further enhance feature extraction capabilities, investigating adaptive weighting mechanisms for dual consistency losses or other self-supervised learning forms tailored for adversarial domain shifts, extending CoMT-VD to a multi-modal framework incorporating audio cues for richer contextual understanding, and optimizing CoMT-VD for real-time inference and deployment on edge devices to address computational efficiency.

Author Contributions

Conceptualization, Z.X.; Methodology, Z.X., Z.T. and B.Z.; Validation, Z.X.; Formal analysis, Z.X.; Data curation, Z.X.; Writing—original draft, Z.X.; Writing—review & editing, Z.T. and B.Z.; Funding acquisition, Z.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant No. 61772125; the Fundamental Research Funds for the Central Universities, China under Grants No. N2317004.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this article are all baseline public datasets for violence detection. The sources of each public dataset are introduced in the experimental section.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Clarin, C.; Dionisio, J.; Echavez, M.; Naval, P. DOVE: Detection of movie violence using motion intensity analysis on skin and blood. PCSC 2005, 6, 150–156. [Google Scholar]
De Souza, F.D.; Chavez, G.C.; do Valle Jr, E.A.; Araújo, A.d.A. Violence detection in video using spatio-temporal features. In Proceedings of the 2010 23rd SIBGRAPI Conference on Graphics, Patterns and Images, Gramado, Brazil, 30 August–3 September 2010; pp. 224–230. [Google Scholar]
Khan, H.; Yuan, X.; Qingge, L.; Roy, K. Violence Detection from Industrial Surveillance Videos Using Deep Learning. IEEE Access 2025, 13, 15363–15375. [Google Scholar] [CrossRef]
Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6479–6488. [Google Scholar]
Maqsood, R.; Bajwa, U.I.; Saleem, G.; Raza, R.H.; Anwar, M.W. Anomaly recognition from surveillance videos using 3D convolution neural network. Multimed. Tools Appl. 2021, 80, 18693–18716. [Google Scholar] [CrossRef]
Wang, Z.; She, Q.; Smolic, A. Action-net: Multipath excitation for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13214–13223. [Google Scholar]
Soliman, M.M.; Kamal, M.H.; Nashed, M.A.E.M.; Mostafa, Y.M.; Chawky, B.S.; Khattab, D. Violence recognition from videos using deep learning techniques. In Proceedings of the 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), Cairo, Egypt, 8–10 December 2019; pp. 80–85. [Google Scholar]
Pandey, B.; Sinha, U.; Nagwanshi, K.K. A multi-stream framework using spatial–temporal collaboration learning networks for violence and non-violence classification in complex video environments. Int. J. Mach. Learn. Cybern. 2025, 16, 4737–4766. [Google Scholar] [CrossRef]
Sun, S.; Gong, X. Multi-scale bottleneck transformer for weakly supervised multimodal violence detection. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
Rendón-Segador, F.J.; Álvarez-García, J.A.; Salazar-González, J.L.; Tommasi, T. Crimenet: Neural structured learning using vision transformer for violence detection. Neural Netw. 2023, 161, 318–329. [Google Scholar] [CrossRef]
Pan, S.J.; Tsang, I.W.; Kwok, J.T.; Yang, Q. Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 2010, 22, 199–210. [Google Scholar] [CrossRef]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Xu, Y.; Cao, H.; Mao, K.; Chen, Z.; Xie, L.; Yang, J. Aligning correlation information for domain adaptation in action recognition. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 6767–6778. [Google Scholar] [CrossRef]
Da Costa, V.G.T.; Zara, G.; Rota, P.; Oliveira-Santos, T.; Sebe, N.; Murino, V.; Ricci, E. Dual-head contrastive domain adaptation for video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 1181–1190. [Google Scholar]
Li, J.; Xu, R.; Liu, X.; Ma, J.; Li, B.; Zou, Q.; Ma, J.; Yu, H. Domain adaptation based object detection for autonomous driving in foggy and rainy weather. arXiv 2023, arXiv:2307.09676. [Google Scholar] [CrossRef]
Dasgupta, A.; Jawahar, C.; Alahari, K. Source-free video domain adaptation by learning from noisy labels. Pattern Recognit. 2025, 161, 111328. [Google Scholar] [CrossRef]
Xu, Y.; Cao, H.; Xie, L.; Li, X.L.; Chen, Z.; Yang, J. Video unsupervised domain adaptation with deep learning: A comprehensive survey. ACM Comput. Surv. 2024, 56, 1–36. [Google Scholar] [CrossRef]
Gao, Z.; Zhao, Y.; Zhang, H.; Chen, D.; Liu, A.A.; Chen, S. A novel multiple-view adversarial learning network for unsupervised domain adaptation action recognition. IEEE Trans. Cybern. 2021, 52, 13197–13211. [Google Scholar] [CrossRef]
Huang, S.W.; Lin, C.T.; Chen, S.P.; Wu, Y.Y.; Hsu, P.H.; Lai, S.H. Auggan: Cross domain adaptation with gan-based data augmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 718–731. [Google Scholar]
Choi, J.; Sharma, G.; Chandraker, M.; Huang, J.B. Unsupervised and semi-supervised domain adaptation for action recognition from drones. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1717–1726. [Google Scholar]
Cen, F.; Zhao, X.; Li, W.; Wang, G. Deep feature augmentation for occluded image classification. Pattern Recognit. 2021, 111, 107737. [Google Scholar] [CrossRef]
Wang, X.; Zhang, R.; Shen, C.; Kong, T.; Li, L. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3024–3033. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Guo, Y.; Ma, S.; Su, H.; Wang, Z.; Zhao, Y.; Zou, W.; Sun, S.; Zheng, Y. Dual mean-teacher: An unbiased semi-supervised framework for audio-visual source localization. Adv. Neural Inf. Process. Syst. 2023, 36, 48639–48661. [Google Scholar]
Mahmoodi, J.; Nezamabadi-pour, H. A spatio-temporal model for violence detection based on spatial and temporal attention modules and 2D CNNs. Pattern Anal. Appl. 2024, 27, 46. [Google Scholar] [CrossRef]
Mahmoodi, J.; Nezamabadi-Pour, H. Violence Detection in Video Using Statistical Features of the Optical Flow and 2D Convolutional Neural Network. Comput. Intell. 2025, 41, e70034. [Google Scholar] [CrossRef]
Buciluǎ, C.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; pp. 535–541. [Google Scholar]
Wu, M.C.; Chiu, C.T.; Wu, K.H. Multi-teacher knowledge distillation for compressed video action recognition on deep neural networks. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2202–2206. [Google Scholar]
Kumar, A.; Mitra, S.; Rawat, Y.S. Stable mean teacher for semi-supervised video action detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 4419–4427. [Google Scholar]
Wang, X.; Hu, J.F.; Lai, J.H.; Zhang, J.; Zheng, W.S. Progressive teacher-student learning for early action prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3556–3565. [Google Scholar]
Xiong, B.; Yang, X.; Song, Y.; Wang, Y.; Xu, C. Modality-Collaborative Test-Time Adaptation for Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 26732–26741. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Singh, A.; Chakraborty, O.; Varshney, A.; Panda, R.; Feris, R.; Saenko, K.; Das, A. Semi-supervised action recognition with temporal contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10389–10399. [Google Scholar]
Shah, K.; Shah, A.; Lau, C.P.; de Melo, C.M.; Chellappa, R. Multi-view action recognition using contrastive learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 3381–3391. [Google Scholar]
Lorre, G.; Rabarisoa, J.; Orcesi, A.; Ainouz, S.; Canu, S. Temporal contrastive pretraining for video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 662–670. [Google Scholar]
Zheng, S.; Chen, S.; Jin, Q. Few-shot action recognition with hierarchical matching and contrastive learning. In Computer Vision—ECCV 2022, Proceedings of the 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 297–313. [Google Scholar]
Nguyen, T.T.; Bin, Y.; Wu, X.; Hu, Z.; Nguyen, C.D.T.; Ng, S.K.; Luu, A.T. Multi-scale contrastive learning for video temporal grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 6227–6235. [Google Scholar]
Dave, I.; Gupta, R.; Rizve, M.N.; Shah, M. Tclr: Temporal contrastive learning for video representation. Comput. Vis. Image Underst. 2022, 219, 103406. [Google Scholar] [CrossRef]
Altabrawee, H.; Noor, M.H.M. STCLR: Sparse Temporal Contrastive Learning for Video Representation. Neurocomputing 2025, 630, 129694. [Google Scholar] [CrossRef]
Sohn, K.; Liu, S.; Zhong, G.; Yu, X.; Yang, M.H.; Chandraker, M. Unsupervised domain adaptation for face recognition in unlabeled videos. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3210–3218. [Google Scholar]
Kim, D.; Tsai, Y.H.; Zhuang, B.; Yu, X.; Sclaroff, S.; Saenko, K.; Chandraker, M. Learning cross-modal contrastive features for video domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 13618–13627. [Google Scholar]
Chen, M.H.; Kira, Z.; AlRegib, G.; Yoo, J.; Chen, R.; Zheng, J. Temporal attentive alignment for large-scale video domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6321–6330. [Google Scholar]
Aich, A.; Peng, K.C.; Roy-Chowdhury, A.K. Cross-domain video anomaly detection without target domain adaptation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2579–2591. [Google Scholar]
Valois, P.H.V.; Niinuma, K.; Fukui, K. Occlusion Sensitivity Analysis With Augmentation Subspace Perturbation in Deep Feature Space. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 1–6 January 2024; pp. 4829–4838. [Google Scholar]
Wang, Z.; Jiang, J.x.; Zeng, S.; Zhou, L.; Li, Y.; Wang, Z. Multi-receptive field feature disentanglement with Distance-Aware Gaussian Brightness Augmentation for single-source domain generalization in medical image segmentation. Neurocomputing 2025, 638, 130120. [Google Scholar] [CrossRef]
Kandel, I.; Castelli, M.; Manzoni, L. Brightness as an augmentation technique for image classification. Emerg. Sci. J. 2022, 6, 881–892. [Google Scholar] [CrossRef]
Choi, J.; Sharma, G.; Schulter, S.; Huang, J.B. Shuffle and attend: Video domain adaptation. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XII 16; Springer: Cham, Switzerland, 2020; pp. 678–695. [Google Scholar]
Sahoo, A.; Shah, R.; Panda, R.; Saenko, K.; Das, A. Contrast and mix: Temporal contrastive video domain adaptation with background mixing. Adv. Neural Inf. Process. Syst. 2021, 34, 23386–23400. [Google Scholar]
Yuan, J.; Liu, Y.; Shen, C.; Wang, Z.; Li, H. A simple baseline for semi-supervised semantic segmentation with strong data augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 8229–8238. [Google Scholar]
Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. In Proceedings of the Advances in Neural Information Processing Systems 29 (NIPS 2016), Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Cheng, M.; Cai, K.; Li, M. RWF-2000: An open large scale video database for violence detection. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 4183–4190. [Google Scholar]
Bermejo Nievas, E.; Deniz Suarez, O.; Bueno García, G.; Sukthankar, R. Violence detection in video using computer vision techniques. In Computer Analysis of Images and Patterns, Proceedings of the 14th International Conference, CAIP 2011, Seville, Spain, 29–31 August 2011; Proceedings, Part II 14; Springer: Cham, Switzerland, 2011; pp. 332–339. [Google Scholar]
Hassner, T.; Itcher, Y.; Kliper-Gross, O. Violent flows: Real-time detection of violent crowd behavior. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 1–6. [Google Scholar]
Song, W.; Zhang, D.; Zhao, X.; Yu, J.; Zheng, R.; Wang, A. A novel violent video detection scheme based on modified 3D convolutional neural networks. IEEE Access 2019, 7, 39172–39179. [Google Scholar] [CrossRef]
Tan, Z.; Xia, Z.; Wang, P.; Wu, D.; Li, L. SCTF: An efficient neural network based on local spatial compression and full temporal fusion for video violence detection. Multimed. Tools Appl. 2024, 83, 36899–36919. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 6836–6846. [Google Scholar]

Figure 1. Examples of domain shift in violent detection. (a) Violent behavior with occlusion in real-world scenarios. (b) Violent behavior with low light in real-world scenarios. (c) Pure-color background. (d) High motion blur.

Figure 2. Overview of the proposed CoMT-VD framework. To enable the model to tackle the domain shift problem, we fine-tuned the pre-trained model using the Mean Teacher framework. At the same time, to ensure that the student network can more effectively leverage the knowledge learned by the teacher network, we introduced a contrastive learning module during the fine-tuning process. Furthermore, to enhance the reliability of the knowledge acquired by the contrastive learning module, we designed a novel DCL contrastive learning approach. The blue line in the figure represents the data flow direction of the self-supervised branch, and the red part shows the data flow direction of the supervised branch.

Figure 3. Illustration of dual-strategy contrastive learning (DCL). To align samples from different distributions, we propose a new contrastive learning mechanism with two key components: Intra-Sample Consistency and Inter-Sample Consistency.

Figure 4. Instances of domain shift under simulated real-world conditions: Occlusions, Rainy Scene, Foggy Scene, and low-illumination environments. These four examples are examples where the baseline model tends to yield detection errors. By combining CoMT VD, all models will obtain more accurate detection results.

Figure 5. ROC curves of C3D and SCTF on cross-dataset voilence detection. The sub-figure (a–c) show the changes in ROC curves of C3D and SCTF models tested on the Hockey Fight dataset before and after incorporating the proposed CoMT-VD framework. The results demonstrate that with CoMT-VD, the AUC values are improved for all models.

Figure 6. Influence of different parameter combinations with loss function for the model. We fix

λ_{1}

at

1.0

and explore the most suitable combination of

λ_{2}

and

λ_{3}

by varying their values within the range

[0, 1]

with a step size of

0.1

. Using the pre-trained SCTF model fine-tuned with CoMT-VD on RLVS (a), RWF-2000 (b), and their mixed dataset (c), and test different weight combinations on Hockey Fight. For clearer visualization, the figure shows the Gaussian-smoothed accuracy variations of fine-tuned models across different weight configurations. The trend reveals that the model achieves peak performance when

λ_{2} = 0.5

and

λ_{3} = 0.3

.

Table 1. Volence detection performance (accuracy %) across five dataset under different domain shift scenarios: rainy, foggy, low-light, and occlusion scenes. The numbers in parentheses (blue color) indicate the absolute performance gains (percentage points) achieved by CoMT-VD over vanilla baseline models.

Methods	VGG16 + LSTM []	VGG16 + LSTM + CoMT-VD	C3D []	C3D + CoMT-VD	ViViT []	ViViT + CoMT-VD	ActionNet []	ActionNet + CoMT-VD	SCTF []	SCTF + CoMT-VD
ACC
Datesets
	Rainny Scenes
RLVS []	71.5	78.3 (+6.8)	70.3	78.5 (+8.2)	69.5	79.3 (+9.8)	73.3	80.3 (+7.0)	74.0	82.8 (+8.8)
RWF-2000 []	69.5	78.8 (+9.3)	70.8	79.3 (+8.5)	68.8	76.3 (+7.5)	70.0	79.5 (+9.5)	73.5	79.8 (+6.3)
Hockey Fight []	72.3	83.5 (+9.2)	71.5	82.3 (+10.8)	70.5	78.0 (+7.5)	73.5	82.3 (+8.8)	76.3	85.0 (+9.7)
Movies []	75.0	80.0 (+5.0)	72.5	80.0 (+7.5)	72.5	77.5 (+5.0)	75.0	80.0 (+5.0)	77.5	85.0 (+7.5)
Violent Flow []	72.0	84.0 (+12.0)	76.0	82.0 (+6.0)	70.0	78.0 (+8.0)	76.0	86.0 (+10.0)	78.0	84.0 (+6.0)
	Foggy Scenes
RLVS []	71.8	77.5 (+5.7)	69.3	76.3 (+7.0)	69.8	75.0 (+5.2)	74.5	84.5 (+10.0)	73.0	83.3 (+10.3)
RWF-2000 []	67.3	78.5 (+11.2)	69.8	78.3 (+8.5)	69.5	75.8 (+6.3)	72.3	80.3 (+8.0)	73.8	81.5 (+7.8)
Hockey Fight []	69.5	78.0 (+8.5)	71.0	77.5 (+6.5)	68.0	77.0 (+9.0)	73.0	82.0 (+9.0)	75.5	83.5 (+8.0)
Movies []	72.5	77.5 (+5.0)	69.0	80.0 (+11.0)	68.5	77.5 (+9.0)	75.0	85.0 (+10.0)	75.0	82.5 (+7.5)
Violent Flow []	70.0	76.0 (+6.0)	72.0	80.0 (+8.0)	68.0	78.0 (+10.0)	78.0	84.0 (+6.0)	78.0	86.0 (+8.0)
	Low-light Scenes
RLVS []	70.0	75.5 (+5.5)	69.3	77.0 (+7.7)	67.5	76.8 (+9.3)	70.3	81.3 (+11.0)	72.0	81.8 (+9.8)
RWF-2000 []	67.0	79.3 (+12.3)	68.5	78.8 (+9.3)	66.5	75.3 (+8.8)	69.0	79.0 (+10.0)	71.3	79.3 (+8.0)
Hockey Fight []	69.0	78.5 (+9.5)	68.0	78.0 (+10.0)	67.5	77.0 (+9.5)	71.5	80.5 (+9.0)	75.5	82.0 (+6.5)
Movies []	70.0	77.5 (+7.5)	70.0	77.5 (+7.5)	70.0	75.0 (+5.0)	70.0	80.0 (+10.0)	75.0	82.0 (+7.0)
Violent Flow []	68.0	78.0 (+10.0)	72.0	80.0 (+10.0)	70.0	76.0 (+6.0)	76.0	86.0 (+10.0)	76.0	84.0 (+8.0)
	Occlusion Scenes
RLVS []	68.8	76.8 (+8.0)	70.3	78.8 (+8.5)	67.5	77.3 (+9.8)	71.3	82.5 (+11.2)	72.0	84.3 (+12.3)
RWF-2000 []	67.3	78.5 (+11.2)	68.3	79.0 (+10.7)	66.8	77.0 (+11.2)	72.0	80.3 (+10.3)	71.8	81.3 (+9.5)
Hockey Fight []	68.5	79.0 (+10.5)	69.0	80.5 (+11.5)	67.5	77.5 (+10.0)	72.5	81.0 (+8.5)	75.5	82.5 (+7.0)
Movies []	70.0	77.5 (+7.5)	67.5	80.0 (+12.5)	67.5	77.5 (+10.0)	72.5	80.0 (+7.5)	72.5	82.5 (+10.0)
Violent Flow []	66.0	78.0 (+12.0)	68.0	76.0 (+8.0)	66.0	76.0 (+10.0)	78.0	84.0 (+6.0)	76.0	86.0 (+10.0)

Table 2. Performance (accuracy %) of models trained on real-world datasets test on the Hockey Fight dataset under Setup-1. We employ real-world datasets (RLVS and RWF-2000, both of them come from surveillance videos, which include indoor/outdoor fights and daily violent behaviors) as training data and evaluated on the Hockey Fight dataset to validate the model’s generalization capability for cross-domain detection. The numbers in parentheses (blue color) indicate the absolute performance gains (percentage points) achieved by CoMT-VD over vanilla baseline models.

Methods	VGG16 + LSTM	VGG16 + LSTM + CoMT-VD	C3D	C3D + CoMT-VD	ViViT	ViViT + CoMT-VD	ActionNet	ActionNet + CoMT-VD	SCTF	SCTF + CoMT-VD
ACC
Datesets
RLVS	67.4	75.2 (+7.8)	69.3	75.1 (+5.8)	68.7	74.8 (+6.1)	72.9	78.6 (+5.7)	74.1	79.5 (+5.4)
RWF-2000	68.3	75.8 (+7.5)	70.0	76.7 (+6.7)	69.4	78.6 (+9.2)	74.9	80.0 (+5.1)	73.5	81.7 (+8.2)
RLVS + RWF-2000	71.3	78.7 (+7.4)	72.3	79.5 (+7.2)	71.3	78.9 (+7.6)	75.8	82.9 (+7.1)	75.0	82.3 (+7.3)

Table 3. Performance of models trained on real-world datasets test on the Movies dataset under Setup-2. We employ real-world datasets (RLVS and RWF-2000, both of them come from surveillance videos, which include indoor/outdoor fights and daily violent behaviors) as training data and evaluated on the Movies dataset to validate the model’s generalization capability for cross-domain detection. The numbers in parentheses (blue color) indicate the absolute performance gains (percentage points) achieved by CoMT-VD over vanilla baseline models.

Methods	VGG16 + LSTM	VGG16 + LSTM + CoMT-VD	C3D	C3D + CoMT-VD	ViViT	ViViT + CoMT-VD	ActionNet	ActionNet + CoMT-VD	SCTF	SCTF + CoMT-VD
ACC
Datesets
RLVS	72.5	78.0 (+5.5)	71.5	76.5 (+5.0)	69.5	76.0 (+6.5)	76.0	82.0 (+6.0)	74.5	82.0 (+7.5)
RWF-2000	69.0	77.5 (+8.5)	72.0	79.5 (+7.5)	68.0	74.0 (+6.0)	74.0	83.5 (+9.5)	76.0	84.0 (+8.0)
RLVS + RWF-2000	72.0	80.0 (+8.00)	70.0	82.0 (+12.0)	72.0	78.0 (+6.0)	78.0	86.5 (+8.5)	76.5	86.0 (+9.5)

Table 4. Performance (accuracy %) with different brightness augmentations. The numbers of red color represent the absolute performance degradation (percentage) lossed by CoMT-VD compared to the best performance. The first column specifies different value combinations of illumination

γ_{s}

(strong augmentation) and

γ_{w}

(weak augmentation) in the illumination-reduction augmentation.

Table 4. Performance (accuracy %) with different brightness augmentations. The numbers of red color represent the absolute performance degradation (percentage) lossed by CoMT-VD compared to the best performance. The first column specifies different value combinations of illumination

γ_{s}

(strong augmentation) and

γ_{w}

(weak augmentation) in the illumination-reduction augmentation.

Methods		RLVS []	RWF-2000 []	Hockey Fight []	Movies []	Violent Flow []
ACC
$γ_{s}$ $γ_{w}$
[0.5, 1.5]	[0.9–1.1]	81.8	79.3	82.3	81.5	84.0
	[0.9–1.2]	79.0 (−2.8)	80.1 (−2.2)	79.6 (−1.9)	78.8 (−2.7)	80.2 (−3.8)
	[1.0–1.1]	78.4 (−3.4)	79.2 (−3.1)	78.1 (−4.2)	77.8 (−3.7)	78.8 (−5.2)
	[0.8–1.0]	79.5 (−2.3)	76.3 (−3.0)	78.5 (−3.8)	77.0 (−4.5)	78.00 (−6.0)
[0.5, 1.4]	[0.9–1.1]	77.2 (−4.6)	76.4 (−2.9)	78.0 (−4.3)	76.5 (−5.0)	79.3 (−4.7)
[0.6, 1.5]		76.8 (−5.0)	75.0 (−4.3)	79.6 (−1.9)	77.8 (−3.7)	78.2 (−5.8)
[0.4, 1.5]		78.0 (−3.8)	78.8 (−2.7)	79.3 (−2.2)	78.2 (−3.3)	80.1 (−3.9)

Table 5. Performance (accuracy %) with different occlussion augmentation rates in the strong and weak augmenter. The numbers of red color represent the absolute performance degradation (percentage) lossed by CoMT-VD compared to the best performance. This experiment investigates the impact of occlusion size on model generalization capability. The numbers of pixels with occlusion are calculated by

⌊ m_{τ_{s}} ⊙ x ⌋ ⊙ ⌊ m_{τ_{s}} ⊙ x ⌋

,

\begin{matrix} b r e a k \end{matrix}

⌊ m_{τ_{w}} ⊙ x ⌋ ⊙ ⌊ m_{τ_{w}} ⊙ x ⌋

respectively, where

⌊ \cdot ⌋

represents rounding down to an integer.

Table 5. Performance (accuracy %) with different occlussion augmentation rates in the strong and weak augmenter. The numbers of red color represent the absolute performance degradation (percentage) lossed by CoMT-VD compared to the best performance. This experiment investigates the impact of occlusion size on model generalization capability. The numbers of pixels with occlusion are calculated by

⌊ m_{τ_{s}} ⊙ x ⌋ ⊙ ⌊ m_{τ_{s}} ⊙ x ⌋

,

\begin{matrix} b r e a k \end{matrix}

⌊ m_{τ_{w}} ⊙ x ⌋ ⊙ ⌊ m_{τ_{w}} ⊙ x ⌋

respectively, where

⌊ \cdot ⌋

represents rounding down to an integer.

Methods		RLVS []	RWF-2000 []	Hockey Fight []	Movies []	Violent Flow []
ACC
$τ_{w}$ $τ_{s}$
0.3	0.1	81.8	79.3	82.3	81.5	84.0
0.3	0.2	77.8 (−4.0)	76.3 (−3.0)	79.8 (−2.5)	78.5 (−3.0)	79.7 (−4.3)
0.2	0.1	76.2 (−5.6)	75.0 (−4.3)	77.5 (−4.8)	76.3 (−5.2)	79.1 (−5.9)
0.4	0.1	78.3 (−3.5)	77.0 (−5.3)	78.6 (−3.7)	78.3 (−3.2)	80.0 (−4.0)

Table 6. Performance (%) with different frame blending rate. The numbers of red color represent the absolute performance degradation (percentage) lossed by CoMT-VD compared to the best performance. The number of blended frames is calculated as

⌊ λ_{n v} \times 16 ⌋

, strengthening temporal feature extraction capabilities.

Table 6. Performance (%) with different frame blending rate. The numbers of red color represent the absolute performance degradation (percentage) lossed by CoMT-VD compared to the best performance. The number of blended frames is calculated as

⌊ λ_{n v} \times 16 ⌋

, strengthening temporal feature extraction capabilities.

Methods		RLVS []	RWF-2000 []	Hockey Fight []	Movies []	Violent Flow []
ACC
$λ_{s}$ $λ_{w}$
0.4	0.1	81.8	79.3	82.3	83.5	84.0
0.4	0.2	79.3 (−2.5)	76.3 (−3.0)	78.6 (−3.7)	80.6 (−2.9)	80.2 (−3.8)
0.3	0.1	78.5 (−3.3)	78.2 (−1.1)	77.9 (−4.4)	80.6 (−2.7)	79.3 (−4.7)
0.5	0.1	77.3 (−4.5)	74.5 (−3.8)	77.2 (−5.1)	79.2 (−4.3)	79.8 (−4.2)

Table 7. Impact of accuracy by Mean Teacher and DCL for cross-dataset detection. To verify the contributions of both modules, we conduct experiments by selectively disabling each component through setting their corresponding loss terms to zero, thereby validating the effectiveness of both components in CoMT-VD. These experiments are performed by training on three datasets (RLVS, RWF-2000, and RLVS + RWF-2000) and testing on the Hockey Fight dataset, aiming to demonstrate CoMT-VD’s capability in addressing domain shift challenges for cross-dataset detection tasks. The numbers in parentheses (blue color) indicate the absolute performance gains (percentage points) achieved by CoMT-VD over vanilla baseline models. × indicates that the corresponding module has been included, while √ indicates that the corresponding module has not been included.

Methods			Dataset
Baseline	Mean Teacher	DCL	RLVS []	RWF-2000 []	RLVS + RWF-2000
C3D	×	×	69.3	70.0	72.3
	√	×	73.3 (+4.0)	75.5 (+5.50)	76.0 (+3.7)
	×	√	71.4 (+2.1)	74.5 (+2.1)	74.25 (+1.75)
	√	√	75.1 (+5.8)	76.7 (+6.7)	79.5 (+7.2)
SCTF	×	×	74.1	73.5	75.0
	√	×	76.3 (+2.2)	78.0 (+4.5)	78.3 (+3.3)
	×	√	76.2 (+2.1)	76.5 (+6.0)	77.5 (+2.5)
	√	√	79.5 (+5.4)	81.7 (+8.2)	82.3 (+7.3)

Table 8. Impact of F1 score by Mean Teacher and DCL for cross-dataset detection. The numbers in parentheses (blue color) indicate the absolute performance gains (percentage points) achieved by CoMT-VD over vanilla baseline models. × indicates that the corresponding module has been included, while √ indicates that the corresponding module has not been included.

Methods			Dataset
Baseline	Mean Teacher	DCL	RLVS []	RWF-2000 []	RLVS + RWF-2000
C3D	×	×	0.709	0.715	0.713
	√	×	0.728 (+0.019)	0.749 (+0.034)	0.742 (+0.029)
	×	√	0.716 (+0.007)	0.734 (+0.021)	0.729 (+0.013)
	√	√	0.748 (+0.039)	0.756 (+0.041)	0.757 (+0.044)
SCTF	×	×	0.712	0.724	0.728
	√	×	0.731 (+0.019)	0.732 (+0.008)	0.738 (+0.010)
	×	√	0.728 (+0.016)	0.729 (+0.005)	0.736 (+0.008)
	√	√	0.769 (+0.057)	0.773 (+0.049)	0.786 (+0.058)

Table 9. Validation of the Impact of accuracy by Mean Teacher and DCL on the Models for unseen scenarios dataset violent video detection. We conduct experiments by selectively disabling each component through setting their corresponding loss terms to zero, thereby validating the effectiveness of both components in CoMT-VD. These experiments are performed by training on five benchmark datasets (RLVS, RWF-2000, Hockey Fight, Movies and Violent) respectively, and test the performance on each simulated scenarios of test set to demonstrate CoMT-VD’s capability in addressing domain shift challenges for unseen conditions dataset detection tasks. The numbers in parentheses (blue color) indicate the absolute performance gains (percentage points) achieved by CoMT-VD over vanilla baseline models. × indicates that the corresponding module has been included, while √ indicates that the corresponding module has not been included.

Methods			Dataset
Baseline	Mean Teacher	DCL	RLVS []	RWF-2000 []	Hockey Fight []	Movies []	Violent Flow []
C3D	×	×	70.3	69.4	70.6	71.6	72.0
	√	×	74.0 (+3.7)	75.3 (+5.9)	76.0 (+5.4)	78.5 (+7.5)	76.0 (+3.3)
	×	√	72.5 (+2.2)	73.8 (+4.4)	74.4 (+3.8)	72.5 (+4.5)	70.0 (+4.0)
	√	√	77.3 (+7.0)	78.8 (+10.3)	79.5 (+11.0)	79.0 (+11.0)	78.0 (+12.0)
SCTF	×	×	72.6	72.8	75.5	75.7	76.2
	√	×	79.5 (+6.9)	78.0 (+5.2)	80.0 (+4.5)	79.5 (+3.8)	82.60 (+6.4)
	×	√	75.5 (+2.9)	74.6 (+1.8)	78.3 (+2.8)	77.3 (+1.6)	79.3 (+3.1)
	√	√	83.8 (+11.2)	80.2 (+7.4)	82.7 (+6.3)	81.5 (+5.8)	85.7 (+9.5)

Table 10. Validation of the Impact of F1 score by Mean Teacher and DCL on the Models for unseen scenarios dataset violent video detection. The numbers in parentheses (blue color) indicate the absolute performance gains (percentage points) achieved by CoMT-VD over vanilla baseline models. × indicates that the corresponding module has been included, while √ indicates that the corresponding module has not been included.

Methods			Dataset
Baseline	Mean Teacher	DCL	RLVS []	RWF-2000 []	Hockey Fight []	Movies []	Violent Flow []
C3D	×	×	0.712	0.721	0.723	0.719	0.726
	√	×	0.726 (+0.014)	0.737 (+0.016)	0.742 (+0.019)	0.751 (+0.032)	0.748 (+0.022)
	×	√	0.719 (+0.007)	0.729 (+0.008)	0.738 (+0.015)	0.727 (+0.008)	0.740 (+0.014)
	√	√	0.732 (+0.019)	0.743 (+0.022)	0.748 (+0.025)	0.756 (+0.037)	0.751 (+0.025)
SCTF	×	×	0.723	0.730	0.739	0.732	0.741
	√	×	0.743 (+0.020)	0.755 (+0.025)	0.762 (+0.023)	0.786 (+0.054)	0.791 (+0.050)
	×	√	0.736 (+0.013)	0.742 (+0.012)	0.769 (+0.030)	0.751 (+0.019)	0.773 (+0.032)
	√	√	0.802 (+0.079)	0.781 (+0.051)	0.793 (+0.054)	0.810 (+0.078)	0.797 (+0.056)

Table 11. Performance (accuracy %) with different similarity thresholds in DCL. The numbers of red color represent the absolute performance degradation (percentage) lossed by CoMT-VD compared to the best performance. For the proposed DCL, the inter-sample consistency alignment controls the matching of positive and negative pairs through upper-bound

δ^{u p p e r}

and lower-bound

δ^{l o w e r}

. From the result, we observe the combination of

0.7

and

0.3

is the optimal values.

Table 11. Performance (accuracy %) with different similarity thresholds in DCL. The numbers of red color represent the absolute performance degradation (percentage) lossed by CoMT-VD compared to the best performance. For the proposed DCL, the inter-sample consistency alignment controls the matching of positive and negative pairs through upper-bound

δ^{u p p e r}

and lower-bound

δ^{l o w e r}

. From the result, we observe the combination of

0.7

and

0.3

is the optimal values.

Methods		RLVS []	RWF-2000 []	Hockey Fight []	Movies []	Violent Flow []
ACC
$δ^{upper}$ $δ^{lower}$
0.7	0.3	81.8	79.3	82.3	81.5	84.0
	0.4	79.0 (−2.8)	78.5 (−2.8)	79.6 (−2.7)	79.3 (−2.2)	82.6 (−1.4)
	0.2	78.9 (−2.9)	79.0 (−0.3)	79.8 (−2.5)	78.9 (−2.6)	80.9 (−3.1)
0.6	0.3	79.8 (−2.0)	78.2 (−1.1)	79.7 (−1.6)	79.2 (−2.3)	82.2 (−1.8)
0.8	0.3	78.3 (−3.5)	77.3 (−2.6)	80.2 (−2.1)	80.4 (−1.1)	81.2 (−2.8)

Table 12. Validation of the performance (accuracy %) of other augmentations. The numbers in parentheses (red color) represent the absolute performance degradation (percentage) loss by CoMT-VD compared to the best performance sets.

Augmentation Methods			Dataset
Augmenter in CoMT-VD	Rotarion	Cropping	RLVS []	RWF-2000 []	Hockey Fight []	Movies []	Violent Flow []
Occlusion Brightness Blending	×	×	83.8	80.2	82.7	81.5	85.7
	×	√	81.6 (−2.2)	79.3 (−0.9)	81.5 (−1.2)	79.0 (−2.5)	83.6 (−2.1)
	√	×	80.3 (−3.5)	79.1 (−1.1)	80.9 (−1.8)	81.8 (−1.9)	84.5 (−1.2)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Overcoming Domain Shift in Violence Detection with Contrastive Consistency Learning

Abstract

1. Introduction

3. Proposed Method

3.1. Problem Definition

3.2. Contrastive Mean Teacher Violence Detection

3.2.1. Method Overview

3.2.2. Teacher–Student Framework

3.2.3. Cross-Domain Augmentation

3.2.4. Dual-Strategy Contrastive Learning

3.2.5. Mean Teacher Optimization

4. Experiments

4.1. Benchmark Datasets

4.2. Baselines

4.3. Experimental Setup

4.4. Experimental Results

4.4.1. Performance for Unseen Conditions

4.4.2. Performance for Cross Datasets

5. Ablation Studies

5.1. Different Augmentation Intensities

5.2. The Impact of Other Factors on Performance

6. Conclusions

6.1. Discussion

6.2. Limitation

6.3. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Overcoming Domain Shift in Violence Detection with Contrastive Consistency Learning

Abstract

1. Introduction

2. Related Work

2.1. Violence Detection

2.2. Knowledge Distillation and Mean Teacher

2.3. Contrastive Learning (CL)

2.4. Domain Adaptation (DA)

3. Proposed Method

3.1. Problem Definition

3.2. Contrastive Mean Teacher Violence Detection

3.2.1. Method Overview

3.2.2. Teacher–Student Framework

3.2.3. Cross-Domain Augmentation

3.2.4. Dual-Strategy Contrastive Learning

3.2.5. Mean Teacher Optimization

4. Experiments

4.1. Benchmark Datasets

4.2. Baselines

4.3. Experimental Setup

4.4. Experimental Results

4.4.1. Performance for Unseen Conditions

4.4.2. Performance for Cross Datasets

5. Ablation Studies

5.1. Different Augmentation Intensities

5.2. The Impact of Other Factors on Performance

6. Conclusions

6.1. Discussion

6.2. Limitation

6.3. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics