Multisource Port Inspection Sensor Fusion with Causal Representation Learning for Cross-Border Anomaly Monitoring

Yin, Jiaxin; Lu, Zhengjia; Xiong, Baodi; Sun, Kai; Liu, Ruijia; Liu, Yachi; Li, Manzhou

doi:10.3390/s26134142

Open AccessArticle

Multisource Port Inspection Sensor Fusion with Causal Representation Learning for Cross-Border Anomaly Monitoring

by

Jiaxin Yin

¹,

Zhengjia Lu

^2,3,

Baodi Xiong

²,

Kai Sun

^1,2,

Ruijia Liu

²,

Yachi Liu

^1,2 and

Manzhou Li

^1,*

¹

China Agricultural University, Beijing 100083, China

²

National School of Development, Peking University, Beijing 100871, China

³

Faculty of Women’s Development, China Women’s University, Beijing 100101, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(13), 4142; https://doi.org/10.3390/s26134142

Submission received: 11 May 2026 / Revised: 23 June 2026 / Accepted: 29 June 2026 / Published: 1 July 2026

(This article belongs to the Special Issue Deep Learning for Perception and Recognition Based on Sensor Data: Methods and Applications, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

With the rapid development of cross-border collaboration, intelligent port construction, and international logistics networks, large volumes of multisource heterogeneous data are continuously generated during cross-border circulation. To address the limitations of traditional financial review and compliance auditing methods in characterizing multisource signal coupling, as well as the tendency of conventional deep models to rely on spurious correlated features with insufficient interpretability, a multisource sensing signal fusion and causally explainable risk identification framework is proposed for cross-border trade anomaly detection. In this framework, electronic trade texts, structured financial declaration fields, GPS/AIS trajectories, port weighing records, RFID data, electronic seal status, X-ray inspection images, cold-chain temperature and humidity records, and vibration data are uniformly modeled as multisource sensing signals in cross-border trade and circulation processes. Subsequently, collaborative representation among textual semantics, attribute fields, logistics status, device records, and entity relationships is achieved through a cross-modal alignment mechanism. On this basis, an engineering-constraint-guided causal risk representation module is designed to reduce the interference of spurious correlated factors, such as regions, ports, transportation modes, and textual styles, in model decisions. Meanwhile, a counterfactual anomaly response module is introduced to analyze the influence of key variable changes on risk outputs, thereby enhancing the model’s ability to identify and explain true anomaly-driving factors. Experimental results show that the proposed method achieves the best overall performance in the cross-border trade anomaly detection task, with Accuracy, Precision, Recall, F1-score, AUC, and PR-AUC reaching

0.927

,

0.842

,

0.811

,

0.826

,

0.958

, and

0.817

, respectively, clearly outperforming baseline models including Logistic Regression, Random Forest, XGBoost, BERT, BERT+MLP, and Multimodal Transformer. In cross-time, cross-region, cross-port, and cross-entity testing scenarios, high F1-score and AUC values are still maintained. Under complex conditions such as text noise, missing modalities, logistics trajectory perturbations, and missing sensing records, only limited performance degradation is observed. Ablation experiments further verify the effective contributions of cross-modal attention, contrastive alignment, causal financial debiasing, counterfactual response, and engineering constraints to performance improvement.

Keywords:

multisource sensor data fusion; cross-border anomaly monitoring; RFID and electronic seal monitoring; cold-chain sensor networks; explainable anomaly detection

1. Introduction

Global economic integration and cross-border e-commerce generate massive volumes of cross-border financial data. This growth drives the need for intelligent financial anomaly monitoring and automated supervision [1]. Supply chain systems continuously generate diverse data throughout the cargo and trade-finance lifecycle. These data span from order placement and export declarations to port inspections, settlement-related document verification, and final delivery. Key data types include invoices, bills of lading, logistics trajectories, and enterprise entity records [2]. Port authorities increasingly adopt electronic systems and intelligent inspection tools. Consequently, supervision transitions from manual reviews to data-driven financial risk monitoring [3].

Fraudulent behaviors, however, grow increasingly complex and concealed. Smugglers use tactics like misinvoicing, tariff classification evasion, and ambiguous commodity descriptions. They also exploit inconsistencies between financial document trails and logistics routes or construct false trade chains [4]. These illicit activities cause tax revenue losses, financial compliance risks, and disrupt market order. Furthermore, they link to severe threats like money laundering, trade-based financial crime, and supply chain security risks [5,6]. Modern cross-border anomalies typically manifest as inconsistencies across multiple data sources rather than single-source errors. Therefore, extracting efficient and explainable financial anomaly signals from heterogeneous data represents a critical challenge for intelligent port supervision.

Traditional risk identification methods rely on manual rules and expert experience. Auditors screen declarations using price thresholds, sensitive country lists, and historical blacklists. These rule-based methods offer high interpretability. However, dynamic changes define the modern cross-border environment. Evaders frequently update their strategies, rendering fixed rules obsolete and causing high false alarm rates [7]. To address this, researchers apply machine learning models like Random Forest and XGBoost to risk prediction. These models automate anomaly detection by analyzing structured fields such as price, weight, and historical statistics [8]. Unfortunately, they struggle to process unstructured texts like commodity descriptions or bill-of-lading remarks. They also fail to model complex relational networks among enterprise entities.

Deep learning offers a promising pathway for cross-border anomaly monitoring [9]. Language models like BERT automatically extract deep semantic representations from document texts, improving the detection of ambiguous declarations [10]. Multimodal learning further fuses text, structured fields, and relational data to uncover complex anomaly patterns [11,12]. Despite these advances, a significant research gap remains in achieving causal robustness and intrinsic interpretability [13]. Current models primarily learn statistical correlations. They mistakenly adopt pseudo-correlated background features, such as specific regions or text styles, as risk indicators. This reliance degrades their generalization across different ports and time periods [14,15]. Furthermore, while post hoc explanation tools like SHAP approximate feature importance after training, they lack formal causal mechanisms. Regulators need models that provide intrinsic, causal-intervention-oriented explainability. Models must natively capture the true drivers of anomalies to support targeted inspections and risk tracing [16]. Finally, practical industrial challenges, including class imbalance, noisy text, and missing modalities, further complicate model deployment [17].

To bridge this gap, we propose a causally explainable multisource fusion framework for cross-border logistics anomaly monitoring. We uniformly model electronic documents, structured fields, and entity relationships as multisource sensing signals. Our approach aligns these diverse signals into a shared feature space. We then introduce an engineering-constraint-guided causal representation mechanism. This mechanism forces the model to ignore spurious correlations and focus on stable risk drivers, such as price deviations or conflicts between commodity descriptions and HS codes. Finally, we design a counterfactual anomaly response strategy. This strategy simulates how modifications to key variables affect the risk output. By doing so, the framework provides intrinsic explainability and directly highlights the root causes of the detected anomalies.

We summarize our primary contributions as follows.

1.: We construct a comprehensive multisource framework that integrates document semantics, structured fields, and entity relationships for cross-border anomaly monitoring.
2.: We propose a cross-modal alignment mechanism that dynamically fuses diverse signals, establishing explicit interactions among texts, logistical attributes, and enterprise chains.
3.: We design an engineering-constraint-guided causal learning module that eliminates spurious environmental correlations and isolates true anomaly-driving representations.
4.: We implement a counterfactual response strategy that intervenes on key variables, providing regulators with transparent and traceable evidence for each anomaly decision.

The remainder of this paper is organized as follows. Section 2 reviews related work in cross-border risk identification, multimodal deep learning, and causal inference. Section 3 details the data collection process and the methodology of the proposed causally explainable framework. Section 4 presents the experimental setup, performance comparisons, ablation studies, and robustness evaluations. Finally, Section 5 concludes the paper and discusses future research directions.

2. Related Work

2.1. Cross-Border Data Anomaly Monitoring and Risk Identification

Early studies on cross-border financial risk identification mainly relied on rule-driven mechanisms, in which expert knowledge was converted into fixed risk rules and anomalous samples were screened through rule matching [1]. For example, when the declared price is substantially lower than the historical average, when the commodity name is inconsistent with the HS code, or when a trading entity appears on a historical blacklist, the corresponding sample is labeled as high risk [5]. From the perspective of engineering-oriented supervision, such methods can be regarded as prior-knowledge-based anomaly-state discrimination systems, with advantages such as transparent rules, strong interpretability, and easy integration into existing port supervision workflows [18]. However, cross-border logistics chains are long, the involved entities are numerous, and declaration strategies and evasion methods change rapidly. Therefore, fixed rules are difficult to maintain sufficient coverage of dynamically evolving anomaly patterns [19]. In addition, rule-based systems usually require substantial manual maintenance. When complex scenarios involve joint anomalies in document information, logistics states, and entity relationships, insufficient coverage and high missed-detection rates may easily occur. Subsequently, statistical learning and traditional machine learning methods were introduced into cross-border risk analysis, with the core idea of learning statistical relationships between declaration variables and financial risk labels from historical samples [20]. For example, Logistic Regression models risk probability through a linear function, whereas Random Forest and XGBoost characterize nonlinear feature relationships through tree structures and ensemble learning. These methods can use structured fields, such as price, quantity, weight, country and HS code, for automated modeling, thereby providing stronger data adaptability than manual rules. However, traditional machine learning methods still mainly rely on structured fields and manually designed features, making it difficult to fully exploit unstructured anomalous clues contained in invoice descriptions, contract remarks, bill-of-lading notes, and customs declaration texts [21]. Meanwhile, these methods usually treat each declaration sample as an independent instance, which makes it difficult to characterize collaborative relationships, historical cooperation links, and group-level anomaly patterns among enterprises. Therefore, their applicability remains limited in real cross-border engineering supervision scenarios. To overcome the limitation of treating declarations as isolated events, graph-based customs fraud detection methods have recently emerged. By modeling enterprises, ports, and logistics agents as nodes, and historical trade interactions as edges, graph neural networks can capture topological anomalies and collusive fraud behaviors that traditional tabular models often miss. However, these graph-based methods usually focus on homogeneous relationship structures and struggle to align complex unstructured document semantics with node attributes.

2.2. Multisource Sensing Signal Fusion and Multimodal Deep Learning

In recent years, deep learning has achieved remarkable progress in natural language processing, anomaly detection, and multisource data fusion, providing a new technical pathway for cross-border trade finance anomaly monitoring [22]. Deep learning methods can automatically learn high-dimensional feature representations through multilayer nonlinear mappings, thereby reducing the dependence on manual feature engineering [9]. For instance, TextCNN extracts local semantic patterns from texts through convolutional operations, BiLSTM captures contextual dependencies through bidirectional sequential modeling, and BERT obtains deep semantic representations through the Transformer architecture and large-scale pretraining [23]. These methods can directly extract implicit anomaly features from document texts, such as commodity descriptions, bill-of-lading notes, and contract summaries, thereby enhancing the ability of models to understand ambiguous declarations, abnormal expressions, and potential risk semantics [24]. With the development of multimodal learning, texts, structured fields, and relational data have begun to be jointly modeled to improve anomaly identification in complex scenarios [25]. In cross-border supervision, electronic documents, declaration fields, logistics states, and enterprise relationships can be regarded as sensing signals from different sources. The key to fusion modeling lies in mapping heterogeneous data into a unified feature space and achieving information fusion through attention mechanisms or cross-modal interaction structures [26]. Similar ideas have been applied to financial risk control, bill auditing, and supply chain analysis, enabling models to jointly use textual semantics and structured variables for risk prediction. Although multimodal deep learning has shown strong potential, several challenges remain in cross-border engineering monitoring scenarios [27]. First, texts often involve mixed Chinese and English expressions, dense professional terminology, frequent abbreviations, and nonstandard writing, resulting in noisy textual signals. Second, obvious asynchrony and semantic alignment difficulties exist among different modalities. Third, multimodal fusion models usually contain a large number of parameters and depend heavily on data quality and sample scale, whereas real trade anomaly samples are often limited and highly imbalanced [21]. In addition, most existing multimodal models are correlation-driven methods that mainly learn statistical associations between input features and risk labels. Due to insufficient characterization of anomaly formation mechanisms and engineering constraint relationships, these models are easily affected by pseudo-correlated features [28].

2.3. Causal Inference and Explainable Anomaly Identification

To alleviate the generalization limitations caused by correlation-based learning, causal inference and causal representation learning have attracted increasing attention in recent years. Unlike traditional correlation modeling, causal inference focuses more on whether changes in variables truly lead to changes in outcomes [18]. Its theoretical foundation is usually built on structural causal models and intervention analysis frameworks, through which stable financial risk-driving factors are identified by analyzing causal pathways among variables [29]. In risk prediction and anomaly detection tasks, causal learning helps distinguish true risk factors from pseudo-correlated background features, thereby improving model robustness under distribution shifts [30]. To address performance degradation caused by environment-specific biases, domain-adversarial neural networks have been widely applied to extract features that are indistinguishable across different domains, thereby promoting generalization. Building upon this, invariant risk minimization introduces a regularization mechanism that enforces the optimal classifier to be the same across all training environments, explicitly driving the model to abandon spurious correlations. Furthermore, recent causal domain-generalization methods integrate structural causal models with representation learning to isolate causal mechanisms from environmental confounders, enabling robust predictions even under severe out-of-distribution scenarios [31]. For example, causal representation learning can constrain models to learn environment-invariant features, so that anomaly discrimination remains stable across different environments. Counterfactual inference constructs contrastive samples by asking how model outputs would change if a key variable were modified, thereby analyzing the true influence of different features on prediction results [32]. However, existing causal learning studies are mostly concentrated on structured data or single-modal tasks, while insufficient attention has been paid to multisource heterogeneous sensing signals in cross-border finance-related scenarios, such as document semantics, declaration fields, logistics states, and entity relationships [33]. Meanwhile, most methods focus more on causal inference itself, and causal constraints are rarely integrated with multimodal deep feature learning within a unified engineering monitoring framework. As a result, it remains difficult for models to simultaneously achieve semantic representation capability, cross-modal fusion capability, and causal robustness [34]. In addition, existing methods still have limitations in interpretability. Although many models can output risk probabilities, they fail to clearly explain which document fields, logistics states, or entity relationships jointly trigger anomalies. This remains inconsistent with the requirements of intelligent port supervision for risk tracing, targeted inspection, and decision support [35].

3. Materials and Method

A multisource dataset was constructed for cross-border logistics and financial anomaly monitoring, with the data collection period spanning from January 2022 to December 2024. The data were mainly obtained from Tianjin Port and its surrounding freight forwarding enterprises, the cross-border service platform of Tianjin Dongjiang Comprehensive Bonded Zone, selected logistics partners associated with Ningbo Zhoushan Port, agricultural product export and cold-chain transportation enterprises in Yantai, Shandong Province, as well as public statistics and publicly available vessel trajectory data, as summarized in Table 1. Tianjin Port was selected as the primary research context because it is a major coastal port in northern China and serves as an important maritime gateway for the Beijing–Tianjin–Hebei region. Its strategic location in the Bohai Sea region, intensive cross-border logistics activities, and diverse multimodal transportation scenarios make it suitable for studying logistics anomaly monitoring. The geographical location of Tianjin Port is shown in Figure 1.

Electronic document data were mainly collected from historical desensitized document records provided by freight forwarding enterprises around Tianjin Port, customs declaration agencies, and cooperating enterprises in Tianjin Dongjiang Comprehensive Bonded Zone. A portion of agricultural product export documents was obtained from agricultural export enterprises and cold-chain logistics enterprises in Yantai, Shandong Province. The data include commercial invoices, packing lists, bills of lading, customs declarations, contract summaries, and commodity description texts. These documents were submitted by import and export enterprises during the declaration process and record information such as commodity names, specifications, usage descriptions, packaging methods, transportation descriptions, and consignee or consignor information. For electronic-form data, fields were directly parsed according to predefined templates. For PDF documents or scanned documents, text extraction was performed using high-resolution scanning devices and OCR tools.

Structured declaration data were mainly obtained from desensitized declaration records in the systems of cooperating customs declaration enterprises, cross-border e-commerce service enterprises, and comprehensive bonded zones. Public data from China Customs statistics and UN Comtrade were further incorporated to construct reference intervals for similar commodity prices and background information. The structured fields include declared price, quantity, weight, currency, country of origin, destination country, HS code, transportation mode, transaction mode, declaration time, and historical declaration statistics of enterprises. Such data are usually entered into systems after online declaration by enterprises and are characterized by clearly defined fields, strong constraints, and complete timestamps. During data collection, currencies, weight units, and quantity units were uniformly converted, while missing values, outliers, and duplicate declaration records were marked. These structured fields also provide direct evidence for trade finance risk assessment.

Logistics trajectory data were mainly collected from transportation node records provided by logistics partners associated with Tianjin Port and Ningbo Zhoushan Port, GPS vehicle-mounted positioning terminals used by cold-chain transportation enterprises, AIS ship automatic identification systems, and publicly available AIS vessel trajectory data. For land transportation samples, longitude, latitude, speed, direction, positioning time, stopping status, and transportation node information recorded by vehicle GPS terminals were primarily collected. For maritime transportation samples, ship position, sailing speed, heading, port calls, and route variation information recorded by AISs were collected. To enhance experimental reproducibility, public AIS trajectory data and selected public GPS trajectory data were also used for auxiliary validation of the trajectory anomaly detection and route deviation analysis modules. Specifically, out of the 42,680 GPS land trajectories, 34,144 samples, accounting for

80 %

, are proprietary operational data collected from logistics partners, while 8536 samples, accounting for

20 %

, are public trajectories. For the 31,900 AIS maritime trajectories, 19,140 samples, accounting for

60 %

, represent proprietary data from associated shipping enterprises, and the remaining 12,760 samples, accounting for

40 %

, are derived from public maritime databases.

Port inspection data were mainly obtained from on-site supervision records of port logistics parks, checkpoint systems in comprehensive bonded zones, and cooperating logistics enterprises. Specifically, these data include weighbridge records, channel camera records, license plate recognition records, RFID identification records, electronic seal status records, and partial X-ray inspection images collected from logistics parks around Tianjin Port, Tianjin Dongjiang Comprehensive Bonded Zone, and cooperative scenarios associated with Ningbo Zhoushan Port. Weighing data were automatically recorded by port weighbridges or dynamic weighing devices and compared with declared weights. X-ray images were collected by container or parcel inspection equipment to assist in determining whether cargo shape, loading density, and declared content were consistent. Camera data mainly recorded vehicle entry and exit, container appearance, and inspection procedures. RFID and electronic seal data were used to record the identification status and opening or closing status of containers or cargo batches at key nodes.

Environmental status data were mainly obtained from transportation monitoring devices deployed on agricultural product export cold-chain routes in Yantai, Shandong Province, as well as from cold-chain warehousing enterprises, bonded warehouses, and selected cross-border e-commerce logistics enterprises. The data mainly include temperature, humidity, vibration, door-opening status, and recording time. For agricultural products, food, pharmaceuticals, and other goods sensitive to transportation environments, such data can reflect environmental stability and cargo preservation status during transportation. During collection, temperature and humidity sensors automatically recorded environmental changes at fixed time intervals, electronic seals or magnetic door sensors recorded container opening and closing status, and vibration sensors recorded abnormal shocks during transportation. To ensure the scientific validity and reproducibility of the study, a rigorous anomaly labeling protocol was established to explicitly define the origin of the anomaly risk labels. These labels were systematically derived from two primary sources, directly addressing the spectrum from confirmed fraud cases to high-risk inspection findings. First, strong positive labels were extracted directly from confirmed customs penalty records and official violation logs provided by cooperating port authorities and customs declaration agencies. These records represent explicitly verified illicit activities, such as confirmed smuggling, severe under-invoicing, and tariff classification evasion, which resulted in formal administrative or legal penalties. Second, weak labels were generated from high-confidence inspection suspicions logged during frontline port operations, combined with expert assessments. These operational indicators include unresolved discrepancies in weighbridge records, abnormal density patterns flagged by X-ray operators, and broken electronic seal alerts. To convert these operational suspicions into reliable ground-truth labels, a multi-stage annotation process was conducted following a double-blind protocol. Three senior customs auditors and domain experts independently reviewed the historical declarations, associated documents, and sensor records to assign the final risk labels. In cases of disagreement, a consensus was reached through joint discussion, ensuring that general risk indicators were rigorously verified rather than automatically labeled as anomalies. Among the total 86,320 declaration samples, 7520 were ultimately identified as positive anomaly samples through this rigorous process, while the remaining 78,800 were classified as negative normal samples, resulting in a class imbalance ratio of approximately 1 to

10.5

. To quantify the inter-annotator agreement during the evaluation of suspected cases, the Cohen Kappa coefficient was calculated, yielding an average value of

0.86

, which indicates a high level of consistency and reliability in the generated labels.

To ensure the legitimacy, compliance, and ethical integrity of the data pipeline, stringent data governance protocols were enforced throughout the collection process. Formal data-sharing agreements and non-disclosure agreements were established with all collaborating entities, including port authorities, customs agencies, and logistics enterprises, granting explicit legal permissions for academic research use. Ethical approval was obtained from the institutional review board of the affiliated research institution prior to data acquisition. To strictly protect commercial confidentiality and individual privacy, a rigorous anonymization procedure was implemented before any data was transferred to the research servers. Specifically, sensitive identifiable information, such as enterprise names, contact details, personal identification numbers, and exact financial account information, was irreversibly masked using one-way cryptographic hashing algorithms. Furthermore, continuous variables such as exact cargo values were subjected to distribution-preserving noise injection to prevent reverse engineering, ensuring that the dataset retains its structural properties for machine learning while fully complying with relevant data privacy regulations. This governance process also supports secure research use of financial compliance-related records.

3.1. Data Preprocessing and Augmentation Strategy

In the task of cross-border document fraud detection, raw data usually exhibit pronounced heterogeneity, noise, and distributional imbalance. Therefore, before model training, multimodal data need to be normalized and distributionally optimized through systematic data preprocessing and data augmentation strategies. In essence, data preprocessing aims to map the original observations X into a representation space

\tilde{X}

that is more suitable for model learning through a series of deterministic or statistical transformations, namely by constructing a mapping function

f_{pre} : X \to \tilde{X}

, thereby improving feature separability and stability. Data augmentation expands the sample distribution by introducing a perturbation function

f_{aug}

while preserving label semantics, so as to improve model robustness to input perturbations. This process can essentially be represented as the construction of an extended distribution

p^{'} (x)

based on the original data distribution

p (x)

.

For textual modality preprocessing, the core principle is to make textual representations more stable and comparable through normalization and semantic alignment of unstructured texts. Let the original text sequence be

x = {w_{1}, w_{2}, \dots, w_{n}}

. A normalized sequence

\tilde{x}

is obtained through the preprocessing function

f_{text}

, namely

\tilde{x} = f_{text} (x)

. In this process, abnormal characters and invalid symbols are first removed by a rule-based function

f_{clean}

, which can be represented as a filtering mapping over the character space

V

. Subsequently, synonymous expressions are mapped into a unified vocabulary space through a terminology normalization function

f_{norm}

, for example, by mapping expressions such as “pcs” and “pieces” to a unified identifier, thereby reducing lexical sparsity. Furthermore, different measurement expressions are converted into a unified dimension through a unit normalization function

f_{unit}

, so that the numerical semantics embedded in the text become consistent. Finally, the text sequence is mapped into a vector representation

h_{text}

through the embedding function

ϕ (\cdot)

:

h_{text} = ϕ (\tilde{x}) = ϕ (f_{unit} (f_{norm} (f_{clean} (x)))) .

(1)

This process essentially reduces noise interference and expression variation in the textual modality, enabling the model to focus more on semantic-level risk signals. In cross-border scenarios, where mixed Chinese–English expressions and frequent abbreviations are common, this preprocessing procedure is particularly important for reducing semantic ambiguity and improving representational consistency. For structured data preprocessing, the core objective is to transform variables of different scales and types into a learnable space through numerical standardization and categorical encoding. For a continuous variable

x_{i}

, normalization or standardization is applied. For example, the standardization operation can be expressed as follows:

{\tilde{x}}_{i} = \frac{x_{i} - μ_{i}}{σ_{i}},

(2)

where

μ_{i}

and

σ_{i}

denote the mean and standard deviation of variable

x_{i}

in the training set, respectively. This operation can eliminate the adverse influence of scale differences among variables on model training. For a categorical variable

c_{j}

, an embedding function

e (\cdot)

is used to map it into a low-dimensional vector representation

h_{j} = e (c_{j})

, so that discrete attributes can participate in feature learning within a continuous space. Meanwhile, for missing values, a conditional expectation imputation strategy is introduced, in which missing variables are estimated using other observed feature information. This can be expressed as:

x_{miss} = E [x ∣ X_{obs}] .

(3)

In addition, because different currencies and measurement units are involved in cross-border data, a conversion function

f_{conv}

is required to unify prices and quantities into standard scales. For example:

x_{price}^{USD} = r_{currency} \cdot x_{price},

(4)

where

r_{currency}

denotes the exchange-rate conversion factor. Through the above processing steps, scale inconsistency and abnormal interference in structured data can be effectively reduced. For relational modality construction, the core idea is to characterize potential dependencies among samples by introducing graph structures or relational vectors, thereby supplementing group-level behavioral information that is missing from individual samples. Let the enterprise set be denoted as

V

, and let the relationships among enterprises form a graph

G = (V, E)

, where the edge weight

A_{i j}

represents the association strength between enterprise i and enterprise j, such as co-occurrence frequency or transaction frequency. Based on this graph structure, node representations can be extracted through graph representation learning:

h_{i}^{(l + 1)} = σ (\sum_{j \in N (i)} A_{i j} W^{(l)} h_{j}^{(l)}),

(5)

where

h_{i}^{(l)}

denotes the node representation at the l-th layer,

W^{(l)}

is a learnable parameter, and

σ

is a nonlinear activation function. This propagation mechanism aggregates information from neighboring nodes into the current node, thereby capturing collaborative risk patterns among enterprises, such as anomalous chains or group-based behaviors. By introducing the relational modality, the model can be extended from individual risk modeling to group risk modeling, significantly improving its ability to identify complex fraud behaviors. For data augmentation, the theoretical basis lies in constructing perturbed samples in the input space so that the model can learn a more stable decision boundary. For the textual modality, a semantics-preserving perturbation function

f_{aug}^{text}

can be defined, such as synonym replacement or expression-variant generation, so that the augmented sample

\tilde{x}

satisfies the semantic consistency constraint:

y (\tilde{x}) = y (x), \tilde{x} = f_{aug}^{text} (x),

(6)

where

y (\cdot)

denotes the labeling function. This strategy can improve model robustness to expression variations. In the structured modality, augmentation can be achieved by introducing small random perturbations to non-critical variables, such as:

{\tilde{x}}_{i} = x_{i} + ϵ, ϵ \sim N (0, σ^{2}) .

(7)

This method expands the data distribution without changing the sample semantics. In addition, a resampling strategy can be adopted to alleviate class imbalance, making the distribution of positive and negative samples more balanced. Within the causal learning framework, data augmentation is further extended to counterfactual sample generation. The core idea is to construct contrastive samples that differ from the original sample in certain variables while keeping other conditions unchanged. Let the original sample be x and the key variable be

x_{k}

. Then, the counterfactual sample

x^{c f}

can be expressed as:

x^{c f} = (x_{1}, \dots, x_{k}^{'}, \dots, x_{n}),

(8)

where

x_{k}^{'}

denotes the intervened value of the key variable. By comparing the model outputs on x and

x^{c f}

, the causal effect of variable

x_{k}

on the prediction result can be characterized as follows:

Δ y = f (x) - f (x^{c f}) .

(9)

This mechanism not only enhances model sensitivity to changes in key variables, but also provides a basis for subsequent explainability analysis. By incorporating counterfactual samples into the training process, the model can be guided to learn more stable causal features, thereby reducing its dependence on pseudo-correlated signals.

3.2. Proposed Method

3.2.1. Overall Architecture

We propose a multisource fusion and causally explainable framework for cross-border logistics anomaly monitoring. Unlike conventional methods relying solely on isolated data types, our model jointly perceives textual semantics, structured attributes, and entity relationships to identify risks. The end-to-end pipeline consists of four main stages. First, specialized encoders process three core inputs: a pretrained language model extracts the semantic representation

h_{i}^{t}

from document texts

x_{i}^{t}

; a tabular encoder processes continuous and categorical declaration fields

x_{i}^{s}

into a structured representation

h_{i}^{s}

; and a graph-based encoder models enterprise associations

x_{i}^{r}

to output a relational representation

h_{i}^{r}

. Second, a cross-modal attention mechanism aligns and interacts these representations in a unified latent space, generating a comprehensive multisource fused representation

z_{i}

. Third, an engineering-constraint-guided causal debiasing module filters

z_{i}

to eliminate environmentally spurious correlations (e.g., regional attributes or textual styles). This isolates true anomaly-driving factors, yielding a stable, debiased risk representation

z_{i}^{c}

. Finally, a counterfactual anomaly response module simulates interventions on key variables (e.g., correcting HS codes or standardizing prices). By comparing the risk outputs of the original and counterfactual samples, the model enhances its sensitivity to true risk drivers. The debiased representation is then fed into a classification layer to output the final anomaly risk score alongside explainable risk attributions.

3.2.2. Multisource Sensing Representation and Cross-Modal Alignment Module

In the multisource sensing representation and cross-modal alignment module, each preprocessed sample is represented as

x_{i} = {x_{i}^{t}, x_{i}^{s}, x_{i}^{r}}

, where

x_{i}^{t}

denotes the electronic document text sequence,

x_{i}^{s}

denotes structured declaration and logistics state fields, and

x_{i}^{r}

denotes entity relationship features. For the textual modality, a pretrained language model is adopted as the semantic encoder. Specifically, the BERT-base model is employed to extract deep semantic features. Texts such as invoice commodity descriptions, packing-list descriptions, bill-of-lading notes, customs declaration remarks, and contract summaries are first fed into the token embedding layer and then encoded by multilayer Transformer blocks to obtain contextual semantic representations. Let the embedding of the text sequence be

E_{i}^{t} = {e_{i, 1}^{t}, \dots, e_{i, L}^{t}}

. The text encoding process can then be expressed as

H_{i}^{t} = {Transformer}_{t} (E_{i}^{t}; θ_{t})

, and the sample-level textual vector is obtained through a special-token representation or attention pooling as

h_{i}^{t} = Pool (H_{i}^{t})

, outputting a 768-dimensional textual feature vector. This design can capture contextual dependencies among commodity names, materials, specifications, uses, packaging methods, and transportation descriptions, thereby avoiding semantic omissions caused by reliance on keyword matching alone.

For the structured modality, continuous fields are first normalized and then fed into a linear mapping layer, while categorical fields such as HS code, country of origin, transportation mode, and port node type are converted into dense vectors through embedding matrices. As shown in Figure 2, let the continuous variables be

u_{i}

and the categorical variables be

v_{i}

. The structured representation can be written as

h_{i}^{s} = {MLP}_{s} ([W_{u} u_{i}; Emb (v_{i})]; θ_{s})

, where

[\cdot; \cdot]

denotes the concatenation operation. The

{MLP}_{s}

consists of several linear layers, nonlinear activations, normalization layers, and dropout layers, and is used to model combinational relationships among price, quantity, weight, currency, HS code, transportation mode, and inspection state. The output dimension of this structured representation

h_{i}^{s}

is set to 256. For the relational modality, if entity associations are represented as a graph, enterprise entities are treated as nodes, while historical transactions, consignee–consignor relationships, or customs-declaration agency relationships are treated as edges; a GraphSAGE (Graph Sample and Aggregate) encoder is then used to obtain the relationship representation of the current declaration entity. If relationship information is provided as statistical features, it is mapped by a relationship feature encoding network. This process can be uniformly expressed as

h_{i}^{r} = {Encoder}_{r} (x_{i}^{r}; θ_{r})

, which generates a 256-dimensional relationship feature vector. This vector is used to characterize link risk information such as enterprise historical cooperation frequency, upstream and downstream stability, risk co-occurrence degree, and abnormality of associated entities. Because multisource sensing signals differ substantially in acquisition devices, data forms, semantic granularity, and temporal distributions, direct concatenation may cause a dominant modality to control model decisions, or may prevent explicit associations from being established among document semantics, declaration fields, and logistics states. Therefore, a cross-modal projection and alignment mechanism is further designed to map

h_{i}^{t}

,

h_{i}^{s}

, and

h_{i}^{r}

into the same latent space, yielding

z_{i}^{t} = W_{t} h_{i}^{t} + b_{t}

,

z_{i}^{s} = W_{s} h_{i}^{s} + b_{s}

, and

z_{i}^{r} = W_{r} h_{i}^{r} + b_{r}

. After dimensional unification, cross-modal attention is adopted to achieve interactions among modalities. Taking the attention of the document textual representation to structured declaration fields as an example, the attention process can be expressed as

A_{t \leftarrow s} = softmax (Q_{t} K_{s}^{⊤} / \sqrt{d}) V_{s}

, where

Q_{t} = z_{i}^{t} W_{Q}

,

K_{s} = z_{i}^{s} W_{K}

,

V_{s} = z_{i}^{s} W_{V}

, and d denotes the latent-space dimension, which is uniformly set to 256 for all modalities to ensure consistent cross-modal interactions. Similarly, bidirectional attention can be constructed between text and relationships as well as between structured fields and relationships, enabling commodity descriptions to actively align with HS codes, price intervals, weight states, and entity relationship features, while allowing relationship-chain information to reversely adjust the understanding of current declaration fields. The interaction representations from different directions are then fed into a gated fusion unit to obtain the final multimodal fused representation:

z_{i} = g_{t} ⊙ {\tilde{z}}_{i}^{t} + g_{s} ⊙ {\tilde{z}}_{i}^{s} + g_{r} ⊙ {\tilde{z}}_{i}^{r},

(10)

where

g_{m} = σ (W_{g} [{\tilde{z}}_{i}^{t}; {\tilde{z}}_{i}^{s}; {\tilde{z}}_{i}^{r}] + b_{g})

denotes the modality weight, and ⊙ denotes element-wise multiplication. This gating mechanism can dynamically adjust the contribution of each modality according to the signal quality and anomaly source of different samples. For instance, when document descriptions are ambiguous but declared prices or weights are clearly abnormal, the weight of the structured modality is increased. When entity relationship anomalies are prominent, the influence of the relational modality on the final representation is enhanced. When obvious conflicts occur between document semantics and HS codes, the interaction weights between textual and structured fields are amplified. To further ensure discriminative consistency of aligned representations, a cross-modal contrastive constraint is introduced, so that representations of different sensing signals from the same sample are pulled closer in the latent space, while modality representations from different samples remain distinguishable. Taking the textual and structured modalities as an example, the alignment loss can be expressed as:

L_{t s} = - log \frac{exp (sim (z_{i}^{t}, z_{i}^{s}) / τ)}{\sum_{j} exp (sim (z_{i}^{t}, z_{j}^{s}) / τ)},

(11)

where

sim (\cdot)

denotes the similarity function, and

τ

is the temperature coefficient. Similar constraints can also be constructed between the relational modality and the textual and structured modalities, resulting in the final alignment objective:

L_{a l i g n} = L_{t s} + L_{t r} + L_{s r} .

(12)

Through this design, anomaly cues from document texts, declaration fields, and entity relationships can be learned separately, while semantic consistency and complementarity among different sensing signals can also be explicitly strengthened. For cross-border logistics anomaly monitoring, this structure facilitates the identification of complex anomaly patterns, such as inconsistencies between commodity descriptions and HS codes, mismatches between declared prices and commodity semantics, inconsistencies between weight records and commodity attributes, deviations between logistics links and declared routes, and abnormal coupling between entity historical relationships and current transaction behaviors. Accordingly, the problems of insufficient information from a single data source and semantic fragmentation across modalities can be alleviated, providing a more stable and complete risk representation foundation for subsequent causal debiasing and counterfactual anomaly response.

3.2.3. Engineering-Constraint-Guided Causal Risk Representation Debiasing Module

The engineering-constraint-guided causal risk representation debiasing module receives the fused representation

z_{i} \in R^{C}

obtained from the previous stage, where C denotes the number of fused feature channels. The objective of this module is not merely to improve classification accuracy on the training set, but to extract risk representations from multisource sensing signals that remain stable across different times, regions, ports, and entity environments.

As shown in Figure 3, the fused representation is first passed through an

L_{c}

-layer causal risk encoding network. In our specific implementation, the number of encoding layers is set to

L_{c} = 3

. The initial input dimension

C_{0}

aligns with the multimodal fused dimension of 256, and the subsequent hidden dimensions are uniformly configured as 256 to maintain information capacity. Each layer consists of a linear mapping, normalization, nonlinear activation, and residual connection. The input width of the l-th layer is

C_{l}

, the output width is

C_{l + 1}

, the height can be interpreted as the sample dimension B, and the number of channels corresponds to the hidden feature dimension

C_{l}

. This process is expressed as:

q_{i}^{l + 1} = ϕ (Norm (W_{c}^{l} q_{i}^{l} + b_{c}^{l})) + R_{l} q_{i}^{l}, q_{i}^{0} = z_{i},

(13)

where

W_{c}^{l} \in R^{C_{l + 1} \times C_{l}}

,

b_{c}^{l} \in R^{C_{l + 1}}

,

R_{l}

is the residual mapping, and

ϕ (\cdot)

is a nonlinear function. For reproducible implementation, the normalization is realized via Layer Normalization, and the activation function employs the Gaussian Error Linear Unit. A dropout operation with a probability of

0.1

is also applied within each block to prevent overfitting. After

L_{c}

layers of encoding, the candidate risk representation

q_{i}

is obtained. Subsequently, a causal gating layer is introduced to decompose

q_{i}

into the stable risk component

q_{i}^{r i s k}

and the environmental bias component

q_{i}^{e n v}

:

m_{i} = σ (W_{m} q_{i} + b_{m}), q_{i}^{r i s k} = m_{i} ⊙ q_{i}, q_{i}^{e n v} = (1 - m_{i}) ⊙ q_{i},

(14)

where

m_{i}

is a learnable causal mask,

W_{m} \in R^{C_{q} \times C_{q}}

, and

C_{q}

denotes the number of channels after causal encoding. To rigorously justify this decomposition from the perspective of a structural causal model, we conceptualize the generation of the observed multimodal representation

q_{i}

as a function of two independent latent subspaces: the true causal factors C that universally determine the anomaly outcome Y, and the environmental confounders E that spuriously correlate with Y due to selection bias across different ports or operational regions. According to the independent causal mechanisms principle, the joint distribution can be factorized, and the causal mechanism

P (Y | C)

remains invariant under interventions on E. In this context, the mask

m_{i}

functions not merely as a deterministic feature selector, but as a differentiable approximation of the structural assignment equations

C \leftarrow f_{C} (q_{i})

and

E \leftarrow f_{E} (q_{i})

. By continuously updating

m_{i}

through the subsequent adversarial environment prediction objective, the network effectively simulates a soft intervention

d o (E)

, enforcing the separation of invariant causal pathways from shifting environmental contexts. Consequently,

m_{i}

isolates the stable structural parents of the anomaly risk, ensuring that

q_{i}^{r i s k}

captures only the genuine anomaly-driving factors, aligning directly with the confounder disentanglement paradigm in formal causal inference. This design enables the model to automatically select feature components that are more stably related to anomaly risk and suppress information that may carry environmental bias. The environmental variables may represent non-core background factors such as regions, ports, transportation modes, time windows, or enterprise categories. To further reduce the environmental identifiability of

q_{i}^{r i s k}

, an environmental discriminator composed of

L_{d}

fully connected layers is introduced. Specifically, the discriminator depth is set to

L_{d} = 2

, with the intermediate hidden layer width configured to 128, followed by a classification layer outputting the environmental predictions. The width of each layer is denoted as

D_{l}

, and the output is an environmental prediction distribution:

{\hat{e}}_{i} = Softmax (W_{d}^{L_{d}} ϕ (\dots ϕ (W_{d}^{1} q_{i}^{r i s k} + b_{d}^{1})) + b_{d}^{L_{d}}) .

(15)

The risk encoder is encouraged to make

q_{i}^{r i s k}

difficult for environmental prediction, whereas the environmental discriminator attempts to correctly identify the environment. To technically execute this adversarial training within a standard automatic differentiation framework, a Gradient Reversal Layer is inserted between the causal gating layer and the environmental discriminator. During the forward propagation, this layer acts as an identity transformation, allowing the discriminator to learn environmental features. During the backward propagation, it multiplies the passing gradients by a negative scaling factor, reversing the optimization direction for the upstream encoder. Therefore, a minimax optimization is formed:

min_{θ_{c}, θ_{y}} max_{θ_{d}} J = L_{y} (g (q_{i}^{r i s k}), y_{i}) - α L_{e} (d (q_{i}^{r i s k}), e_{i}) + β {∥q_{i}^{r i s k ⊤} q_{i}^{e n v}∥}_{F}^{2} .

(16)

where

L_{y}

denotes the anomaly classification loss,

L_{e}

denotes the environmental discrimination loss, and the last term enforces orthogonality between the risk component and the environmental component to reduce information overlap. The rationale can be explained as follows. If

q_{i}^{r i s k}

still contains substantial environmental information, then a discriminator

d (\cdot)

exists such that

L_{e}

becomes small. In minimax training, the risk encoder and the environmental discriminator engage in an adversarial game that can be formally interpreted as minimizing a variational upper bound on the mutual information

I (q_{i}^{r i s k}; e_{i})

. Let the discriminator

D_{ϕ} (e_{i} ∣ q_{i}^{r i s k})

serve as a variational approximation of the true posterior

P (e_{i} ∣ q_{i}^{r i s k})

. According to the variational information bound framework, the mutual information is rigorously bounded by:

I (q_{i}^{r i s k}; e_{i}) = H (e_{i}) - H (e_{i} ∣ q_{i}^{r i s k}) \leq H (e_{i}) + E_{q, e} [log D_{ϕ} (e_{i} ∣ q_{i}^{r i s k})] .

(17)

During the adversarial optimization process, the discriminator maximizes the expected log-likelihood

E [log D_{ϕ} (e_{i} ∣ q_{i}^{r i s k})]

to tighten this bound, while the risk encoder minimizes this term, which is mathematically equivalent to maximizing the adversarial loss

L_{e}

, to drive the upper bound of the mutual information toward zero. When the minimax game reaches equilibrium, the encoder completely obfuscates the environmental information, forcing the optimal variational posterior

D_{ϕ} (e_{i} ∣ q_{i}^{r i s k})

to degenerate to the marginal prior

P (e_{i})

. Under this condition, the formal constraint is achieved:

lim_{equilibrium} I (q_{i}^{r i s k}; e_{i}) \leq H (e_{i}) - H (e_{i}) = 0 .

(18)

Therefore,

q_{i}^{r i s k}

is formally constrained to be independent of environmental variables, and the dependence of the model on pseudo-correlated factors such as regions, transportation modes, port distributions, and textual styles is mathematically weakened. Meanwhile, the classification loss still requires

q_{i}^{r i s k}

to retain information that is discriminative for the anomaly label

y_{i}

. Consequently, a stable risk representation that is effective for the label but insensitive to the environment is learned. When applied to the proposed task, this module can prevent the model from incorrectly treating high-frequency routes, specific country attributes, fixed transportation modes, or common textual expressions as risk origins. Instead, the model is encouraged to focus on clues with engineering causal meanings, such as conflicts between commodity descriptions and HS codes, price deviations, inconsistencies in quantity and weight, abnormal logistics paths, abnormal inspection records, and abnormal coupling of entity relationships, thereby improving generalization ability and explanatory reliability across time, regions, ports, and entities.

The counterfactual perturbation generator is parameterized as a 2-layer feedforward network. It takes the concatenation of the 256-dimensional stable risk representation and a 16-dimensional learnable embedding corresponding to the specific intervention factor as input. The hidden layer dimension is configured to 128, and the output dimension aligns with the 256-dimensional latent space to facilitate direct representation addition. The shared risk evaluator is constructed as a 3-layer fully connected classification network, featuring progressively decreasing hidden dimensions of 128 and 64, terminating in a 1-dimensional output bounded by a Sigmoid activation function to yield the final anomaly risk probability. Consistent with the causal encoding network, Layer Normalization and Gaussian Error Linear Unit activations are applied throughout to maintain optimization stability. Furthermore, to guarantee that the generated counterfactual samples reflect realistic operational changes rather than unconstrained mathematical artifacts, the perturbation magnitude regularization weight is strictly set to

0.1

. This restricts the latent counterfactual shifts to a bounded neighborhood around the original sample, ensuring that the theoretical response analysis remains practically grounded and fully reproducible.

3.2.4. Counterfactual Anomaly Response Enhancement Module

The counterfactual-inference-based anomaly discrimination enhancement module receives the stable risk representation

r_{i} \in R^{C_{r}}

after causal debiasing and constructs counterfactual risk responses around key intervenable factors in cross-border supervision. This module consists of a counterfactual perturbation generator, a shared risk evaluator, and a response consistency constraint.

As shown in Figure 4, the counterfactual perturbation generator adopts an

L_{f}

-layer feedforward network. The input width of each layer is

F_{l}

, the output width is

F_{l + 1}

, the height in the sample dimension is B, and the number of channels corresponds to the latent risk variable dimension

F_{l}

. Its input is the concatenation of the stable risk representation

r_{i}

and the intervenable field embedding

a_{i}^{k}

, where

a_{i}^{k}

denotes the k-th key variable, such as commodity description standardization, HS code consistency, price deviation degree, declared weight consistency, logistics route deviation degree, or entity relationship abnormality. The counterfactual perturbation generation process can be expressed as:

δ_{i}^{k} = W_{f}^{L_{f}} ψ (W_{f}^{L_{f} - 1} \dots ψ (W_{f}^{1} [r_{i}; a_{i}^{k}] + b_{f}^{1}) + b_{f}^{L_{f} - 1}) + b_{f}^{L_{f}},

(19)

where

W_{f}^{l} \in R^{F_{l + 1} \times F_{l}}

,

b_{f}^{l} \in R^{F_{l + 1}}

,

ψ (\cdot)

denotes the nonlinear activation function, and

δ_{i}^{k}

denotes the counterfactual perturbation generated in the latent space after intervention on the k-th type of risk factor. Compared with directly modifying original document fields or sensor records, generating counterfactual samples in the latent representation space can avoid the problems of text-field discreteness, non-differentiable categorical fields, difficulty in continuously modifying logistics trajectories, and difficulty in directly optimizing relationship links. The counterfactual risk representation is defined as:

{\bar{r}}_{i}^{k} = r_{i} + T_{k} δ_{i}^{k},

(20)

where

T_{k} \in R^{C_{r} \times F_{L_{f}}}

is the projection matrix corresponding to the k-th type of intervention variable and is used to map the counterfactual perturbation back to the risk representation space. Subsequently, the original representation

r_{i}

and the counterfactual representation

{\bar{r}}_{i}^{k}

share the same risk evaluator

p (\cdot)

, which is composed of an

L_{p}

-layer fully connected classification network. The width of each layer is denoted as

P_{l}

, and the final output is the anomaly risk probability:

s_{i} = σ (p (r_{i}; θ_{p})), {\bar{s}}_{i}^{k} = σ (p ({\bar{r}}_{i}^{k}; θ_{p})) .

(21)

To make counterfactual changes consistent with cross-border supervision logic, directional constraints are imposed on different types of variables. If the k-th type of variable is corrected to a more reasonable state, such as standardized commodity descriptions, consistency between HS codes and commodity semantics, declared prices returning to normal intervals, consistency between declared weights and inspection weights, restored reasonable logistics routes, or removal of abnormal entity associations, the risk score of high-risk samples should decrease. This constraint can be expressed as:

L_{d i r} = \sum_{i} \sum_{k} ω_{i}^{k} max (0, {\bar{s}}_{i}^{k} - s_{i} + γ_{k}),

(22)

where

ω_{i}^{k}

denotes the weight indicating whether the k-th type of variable is intervenable for sample i, and

γ_{k}

denotes the expected risk-change margin. For non-critical background variables, the model output should not change drastically under slight perturbations. Therefore, a stability constraint is further introduced:

L_{s t a} = \sum_{i} \sum_{k \in B} {∥p (r_{i}; θ_{p}) - p ({\bar{r}}_{i}^{k}; θ_{p})∥}_{2}^{2},

(23)

where

B

denotes the set of background variables. The final objective for counterfactual discrimination enhancement is:

L_{c f}^{n e w} = η_{1} L_{d i r} + η_{2} L_{s t a} + η_{3} \sum_{i} \sum_{k} {∥δ_{i}^{k}∥}_{2}^{2} .

(24)

where the last term is used to constrain the magnitude of counterfactual perturbations, ensuring that the generated counterfactual samples do not deviate excessively from the original samples. Its mathematical rationale can be explained through local risk response analysis. By performing a first-order expansion of the shared risk evaluator around

r_{i}

, the following can be obtained:

{\bar{s}}_{i}^{k} - s_{i} \approx \nabla_{r} p {(r_{i}; θ_{p})}^{⊤} T_{k} δ_{i}^{k} .

(25)

Therefore, if an intervention direction

T_{k} δ_{i}^{k}

is highly aligned with the risk gradient direction, the corresponding variable has a significant influence on model discrimination. If the two directions are approximately orthogonal, the corresponding variable contributes little to the risk output. By minimizing

L_{d i r}

, the model is forced to produce a risk decrease consistent with logic when key risk variables are corrected. By minimizing

L_{s t a}

, the model remains stable under perturbations of background variables. Accordingly, for the key risk set

K

and the background set

B

, the model learning objective approximately satisfies:

|\nabla_{r} p {(r_{i}; θ_{p})}^{⊤} T_{k} δ_{i}^{k}| > |\nabla_{r} p {(r_{i}; θ_{p})}^{⊤} T_{b} δ_{i}^{b}|, k \in K, b \in B .

(26)

This indicates that the model becomes more sensitive to true risk factors near the decision boundary while remaining more robust to non-causal background factors. When applied to the proposed task, this module can extend anomaly detection from simple probability output to an intervenable risk explanation process. For example, when a declaration sample is classified as high risk, the changes in the risk score after standardizing commodity descriptions, correcting HS codes, adjusting declared prices, calibrating weight records, restoring reasonable logistics paths, or removing abnormal entity associations can be further analyzed, so that the most important anomaly sources can be identified.

To ensure that these latent-space counterfactuals reflect realistic operational changes, the intervention magnitude is rigorously justified and verified. Specifically, the magnitude constraint governed by the parameter

η_{3}

is dynamically calibrated based on the statistical variance of the corresponding feature distribution observed in historical normal samples. This guarantees that the synthetic perturbations remain strictly bounded within the feasible operational ranges permitted by current customs regulations. Furthermore, to validate the practical plausibility of this module, qualitative assessments were systematically conducted by domain experts, including senior customs auditors. The experts reviewed concrete operational examples mapped from the latent interventions. For instance, in a scenario where a high risk score was primarily driven by a mismatch between the declared weight and the port weighbridge record, the latent counterfactual perturbation simulated the operational correction of calibrating the declared weight. The experts confirmed that the corresponding substantial decrease in the predicted risk score accurately mirrored real-world inspection logic, where successfully resolving such a discrepancy clears the regulatory suspicion. Similarly, latent shifts corresponding to normalizing an abnormally low declared price to a standard market interval, or amending an HS code to align perfectly with the semantic commodity description, were verified by the auditors as standard, logical regulatory rectification procedures. Consequently, these expert validations demonstrate that the generated counterfactuals robustly correspond to realistic inspection scenarios, thereby providing actionable and practically sound evidence for risk tracing and making the inference-stage results strictly consistent with practical regulatory requirements.

4. Results and Discussion

4.1. Experimental Configuration

4.1.1. Hardware and Software Platform

The experiments were conducted on a high-performance hardware platform equipped with graphics processing units and multicore central processing units to support the training and inference of the proposed multimodal deep model. Specifically, the server was equipped with an NVIDIA A100 GPU based on the NVIDIA Ampere architecture, with 40 GB of video memory, which effectively supports large-scale text encoding and multimodal feature fusion. The central processing unit was an Intel Xeon Gold 6230, which provides multicore parallel computing capability for accelerating data preprocessing and batch loading. The system memory was configured as 128 GB to ensure stable operation for large-scale samples and intermediate feature caching. NVMe solid-state drives were used for storage to improve data reading and writing efficiency. The overall hardware environment satisfies the computational performance and storage bandwidth requirements of cross-border multisource heterogeneous data modeling and provides favorable stability and scalability during model training.

For the software platform, the experimental system was built on a mainstream deep learning development environment. Ubuntu

20.04

was adopted as the operating system to ensure compatibility and stability. PyTorch was selected as the core implementation framework for model construction, automatic differentiation, and training optimization. The text encoding component relied on pretrained language model interfaces provided by Hugging Face Transformers to support semantic representation learning for cross-border document texts. Graph-structured modeling was implemented using PyTorch Geometric to model enterprise relationships and trade-chain features. Data processing and analysis were conducted using NumPy and Pandas. The overall software environment supports complex procedures such as multimodal data processing, causal-constraint training, and counterfactual sample generation, thereby providing a reliable basis for the experiments.

In the experimental setting, to rigorously prevent any potential temporal data leakage and resolve any ambiguity regarding the data partitioning, the combination of the fixed split and cross-validation is clearly defined. The evaluation does not rely on two conflicting procedures; instead, the 7:1:2 ratio is dynamically integrated into an out-of-time 5-fold cross-validation strategy applied across the entire dataset. Specifically, the data sequence was divided into chronological blocks using a rolling-window approach. Within each of the 5 folds, the corresponding historical data window was partitioned into a training set and a validation set, while the strictly subsequent future block served as the test set, maintaining an exact overall sample ratio of 7:1:2 for training, validation, and testing along the timeline. This ensures that in every single fold, all test samples strictly occurred after the training and validation periods. The averaged results over these multiple rolling runs were taken as the final performance indicators. Furthermore, we implemented strict data isolation strategies during feature engineering. It is guaranteed that enterprise relationship features and historical declaration statistics were computed exclusively using past records available prior to the exact timestamp of each target sample. No future information was exposed during the construction of entity relationship graphs or historical statistical fields, and samples from the same enterprise were strictly constrained by the timeline to prevent future data from leaking into the training phase, thereby ensuring that the reported performance reflects genuine predictive capability and realistic cross-region generalization.

During model training, the key hyperparameters were set as follows. The maximum sequence length of the text encoder was set to 256 to balance semantic representation capability and computational cost. The overall model was trained with a batch size of 32. AdamW was selected as the optimizer, with the initial learning rate set to

α = 2 \times 10^{- 5}

, and a linear warm-up and decay strategy was adopted to improve training stability. The hidden dimension of the multimodal fusion layer was set to 256 to ensure effective representation of different modality features in a unified space. The weighting coefficients of the causal constraint loss and the counterfactual loss were set to

λ_{1} = 0.5

and

λ_{2} = 0.3

, respectively, to balance classification performance and causal robustness. The number of training epochs was set between 20 and 30, and an early-stopping strategy based on validation-set performance was adopted to prevent overfitting. In addition, during counterfactual sample construction, the perturbation magnitude of key fields was controlled within a reasonable range to ensure that the generated samples were both interpretable and consistent with logic. The overall hyperparameter configuration was obtained through multiple rounds of tuning, achieving a favorable balance between performance and stability.

4.1.2. Baseline Models and Evaluation Metrics

In the comparative experiments, several representative models were selected for performance comparison, including traditional machine learning models such as Logistic Regression [36], Random Forest [37], and XGBoost [38]; deep learning models using only structured features, such as MLP [39] and TabTransformer [40]; models using only the textual modality, such as TextCNN [41], BiLSTM [42], and BERT [43]; conventional multimodal fusion models, such as BERT+MLP [43] and Multimodal Transformer [9]; and recent state-of-the-art advanced frameworks, including the graph neural network model MSDG [44], the multimodal generative framework MFGAN [45], the causality-aware domain generalization method CSRDN [46], and the multi-modal domain generalization approach MDGAR [47]. Logistic Regression is a classical classification model based on a linear assumption, in which the relationship between features and labels is characterized by parameters

β

. Its advantage lies in its simple structure and favorable interpretability. Random Forest performs prediction by constructing multiple decision trees and conducting ensemble learning, thereby effectively reducing the risk of overfitting, with strong capability in modeling nonlinear relationships. XGBoost is an ensemble model based on the gradient boosting framework, in which high-accuracy prediction is achieved through iterative optimization of the loss function

L

; it is characterized by stable performance and excellent results on structured data. MLP, as a multilayer feedforward neural network, learns complex feature representations through multilayer nonlinear mappings and is advantageous due to its flexible structure and ease of implementation. TabTransformer introduces a self-attention mechanism to model tabular data, allowing relationships among features to be adaptively learned through attention weights

α

, thereby capturing feature interactions. TextCNN extracts local semantic features through convolutional operations and is suitable for short-text modeling, with high computational efficiency and sensitivity to local patterns. BiLSTM captures contextual dependencies through bidirectional sequence modeling and can learn sequential features, showing strong capability in contextual semantic modeling. BERT is pretrained based on the Transformer architecture and obtains deep semantic representations through context-aware encoding, providing strong language understanding ability. BERT+MLP is a simple multimodal fusion method in which textual representations and structured features are concatenated before being fed into a classifier, with the advantages of simple implementation and stable fusion performance. Multimodal Transformer realizes information interaction among different modalities through cross-modal attention mechanisms, dynamically fusing representations through attention weights

α

and effectively modeling complex associations among multiple modalities. MSDG captures complex topological associations among entities through a multi-scale dynamic graph neural network. MFGAN utilizes an attention-based autoencoder and a generative adversarial network to achieve robust multimodal fusion. CSRDN leverages domain intervention from a causality-aware perspective to learn invariant features for domain generalization. MDGAR incorporates a multi-modal approach specifically targeting domain generalization to enhance robustness across different environments.

For model evaluation, Accuracy was used to measure the overall classification correctness, Precision was used to characterize the proportion of truly positive samples among samples predicted as positive, Recall was used to measure the ability to identify truly positive samples, F1-score was used to balance Precision and Recall, AUC was used to evaluate the overall ranking ability of the model, PR-AUC was used to emphasize positive-class identification under class imbalance, Top-K Hit Rate was used to measure whether high-risk samples appeared among the top K predicted results, and Early Warning Gain was used to evaluate the warning benefit of the model under limited screening resources.

Accuracy = \frac{T P + T N}{T P + T N + F P + F N},

(27)

Precision = \frac{T P}{T P + F P},

(28)

Recall = \frac{T P}{T P + F N},

(29)

F 1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall},

(30)

AUC = \int_{0}^{1} T P R (F P R^{- 1} (x)) d x,

(31)

PR-AUC = \int_{0}^{1} Precision (R e c a l l) d R e c a l l,

(32)

Top-K Hit Rate = \frac{Number of positives in Top-K}{Total positives},

(33)

Early Warning Gain = \frac{Captured risk in Top-K}{Total risk} .

(34)

In the above equations,

T P

denotes true positives,

T N

denotes true negatives,

F P

denotes false positives, and

F N

denotes false negatives.

T P R

denotes the true positive rate, namely Recall,

F P R

denotes the false positive rate, and K denotes the sample-number threshold for screening. “Captured risk” denotes the number of risk samples captured among the top K samples, while “Total risk” denotes the total number of risk samples in the dataset.

4.2. Performance Comparison with Baseline Models

This experiment was designed to evaluate the overall effectiveness of the proposed method in cross-border logistics anomaly monitoring and to systematically compare it with traditional machine learning methods, single-modality deep learning models, and conventional multimodal fusion models. To ensure the statistical reliability of our findings, all experimental results are reported as the mean ± standard deviation across the 5-fold cross-validation. Furthermore, we conducted a paired t-test to assess the statistical significance of the performance improvements. The specific p-values comparing the proposed method against the best-performing baseline for each metric are explicitly reported in the bottom row of the table, confirming that the improvements are statistically significant (e.g.,

p = 0.014

for Accuracy and

p = 0.011

for F1-score).

As shown in Table 2 and Figure 5, Logistic Regression achieved Accuracy, F1-score, and AUC values of

0.821

,

0.646

, and

0.842

, respectively, showing the weakest overall performance. This indicates that a linear model is insufficient for characterizing the complex nonlinear risk relationships in cross-border data. Random Forest and XGBoost increased the F1-score to

0.684

and

0.707

, respectively, suggesting that ensemble tree models are better able to handle nonlinear combinations in structured declaration fields. However, their utilization of textual semantics, logistics states, and entity relationships remains limited. The F1-score of MLP was

0.691

, which was slightly higher than that of Random Forest but lower than that of XGBoost. This indicates that multilayer nonlinear mapping provides a certain level of representation capability, but heterogeneous signals cannot be sufficiently modeled without an explicit feature interaction mechanism. TabTransformer achieved an F1-score and AUC of

0.728

and

0.903

, respectively, outperforming conventional structured-data models. This demonstrates that the self-attention mechanism can more effectively capture dependencies among fields such as price, weight, HS code.

From the results of textual and multimodal models, the F1-scores of TextCNN and BiLSTM were

0.669

and

0.677

, respectively. These models can extract local semantic patterns or contextual dependencies, but they remain insufficient for modeling complex long-range semantics and multisource signal interactions. The F1-score of BERT increased to

0.722

, indicating that pretrained language models can better understand deep semantics in commodity descriptions, bill-of-lading notes, and contract summaries. BERT+MLP achieved an F1-score of

0.758

, showing that simple concatenation of textual semantics and structured fields can already produce clear performance gains. Multimodal Transformer further achieved an Accuracy of

0.903

and an F1-score of

0.783

, indicating that cross-modal attention can dynamically model dependencies among different modalities. To further validate against recent state-of-the-art advances, the newly added baselines were evaluated. MSDG achieved an F1-score of

0.793

and an AUC of

0.941

by leveraging graph structures to model complex entity relationships. MFGAN reached an F1-score of

0.801

, demonstrating the effectiveness of generative adversarial networks in multimodal feature fusion. Furthermore, the domain-generalization methods CSRDN and MDGAR achieved F1-scores of

0.806

and

0.813

, respectively. Their strong performance, along with high AUC values of

0.948

and

0.951

, highlights that incorporating causal constraints and domain-invariant learning significantly improves model robustness. Despite the highly competitive performance of these advanced frameworks, the proposed method achieved the best results across all metrics, with Accuracy, Precision, Recall, F1-score, AUC, and PR-AUC reaching

0.927

,

0.842

,

0.811

,

0.826

,

0.958

, and

0.817

, respectively. This improvement is mainly attributed to the fact that multisource sensing signal fusion is combined with causal debiasing to weaken the influence of pseudo-correlated factors such as region, transportation mode, and textual style, while counterfactual responses are used to enhance model sensitivity to true anomaly-driving factors. Therefore, compared with models that only learn statistical correlations, the proposed method can obtain a more stable, explainable, and cross-scenario-adaptive risk representation.

4.3. Generalization and Robustness Evaluation

This experiment was designed to evaluate the generalization ability and robustness of the proposed method in realistic cross-border supervision environments, rather than merely assessing static classification performance under a random test split. All results in these evaluation scenarios are presented as the mean ± standard deviation to accurately reflect the model variance across different folds and perturbation settings.

As shown in Table 3 and Figure 6, the random test split represents the standard testing scenario, under which the model achieved an Accuracy of

0.927

, an F1-score of

0.826

, and an AUC of

0.958

. This indicates that when the training and test distributions are relatively consistent, the proposed method can sufficiently exploit multisource sensing signals to complete anomaly identification. In cross-time testing, Accuracy and F1-score decreased to

0.913

and

0.804

, respectively, suggesting that model performance is affected when commodity prices, transportation routes, and anomaly strategies change over time. However, the performance degradation remained limited, indicating that the causal debiasing mechanism can alleviate pseudo-correlation interference caused by temporal drift. The F1-scores of cross-region and cross-port testing were

0.793

and

0.786

, respectively, showing that differences in commodity structures, regulatory procedures, logistics nodes, and declaration habits across regions and ports lead to more obvious distribution shifts. Nevertheless, the model maintained a high AUC, demonstrating stable risk-ranking capability. In cross-entity testing, the F1-score decreased to

0.777

, which represents a relatively large drop among the generalization scenarios. This suggests that historical behavior patterns, transaction relationships, and declaration habits of unseen enterprise entities vary substantially, imposing higher requirements on the model.

From the robustness results, the model still achieved an Accuracy of

0.909

and an F1-score of

0.798

under text noise perturbation, which was quantitatively defined by randomly replacing or masking

15 %

of the input tokens. This indicates that the pretrained language encoder and cross-modal fusion mechanism can resist OCR errors, abbreviation confusion, and nonstandard commodity descriptions to a certain extent. Under missing modality simulation, where structured or relational features were randomly dropped for

20 %

of the samples, the F1-score was

0.781

, showing that model performance declines when partial document, logistics, or relationship signals are unavailable. However, the gated fusion structure can dynamically adjust information contributions according to the available modalities, preventing severe degradation. Under logistics trajectory perturbation, generated by adding Gaussian noise with a standard deviation of

0.01

degrees to the longitude and latitude coordinates, and sensor record missing scenarios, where

15 %

of hardware records were randomly set to null, the F1-scores were

0.791

and

0.784

, respectively. These results indicate that hardware sensing signals such as GPS, AIS, weighing, and electronic seals make important contributions to risk identification, while the model does not completely rely on a single sensing source. Theoretically, the stability of the proposed method comes from three aspects. Cross-modal attention establishes complementary relationships among different signals and prevents noise from a single input source from dominating the decision. Causal representation learning suppresses unstable correlations in regional, port, and entity backgrounds, making the risk representation closer to the anomaly mechanisms shared across scenarios. Counterfactual response constraints enhance model sensitivity to changes in key risk variables. Therefore, even when time, region, entity, and sensing-signal quality change, the model can still maintain strong risk-ranking and priority-screening capabilities.

4.4. Ablation Study

This ablation experiment was designed to verify the independent contribution of each component in the proposed multisource sensing fusion and causally explainable framework, and to analyze the effects of different modalities, alignment mechanisms, causal debiasing, counterfactual response, and engineering constraints on final anomaly identification performance. Similar to the baseline comparisons, the results are presented as the mean ± standard deviation. Paired t-tests were conducted to compare the full model against each ablation variant. The specific p-values, demonstrating the statistical significance of the performance drop when removing a core module compared to the best-performing variant, are explicitly added to the bottom row of the table.

As shown in Table 4 and Figure 7, the full model achieved the best results across all metrics, with Accuracy, Precision, Recall, F1-score, AUC, and PR-AUC reaching

0.927

,

0.842

,

0.811

,

0.826

,

0.958

, and

0.817

, respectively. This indicates that multisource signal fusion, causal constraints, and counterfactual explanation form an effective complementary relationship. After the document semantic modality was removed, the F1-score decreased to

0.783

, indicating that semantic information contained in commodity descriptions, bill-of-lading notes, and contract summaries is important for identifying ambiguous commodity names, inconsistent descriptions, and implicit anomalies. After the structured declaration modality was removed, the F1-score further dropped to

0.771

, which was the largest decrease among the single-modality removal experiments. This suggests that fields such as price, weight, quantity, HS code are fundamental discriminative signals for cross-border anomaly monitoring. After the entity relationship modality was removed, the F1-score was

0.791

. Although the decline was smaller than those caused by removing the first two modalities, the result was still clearly lower than that of the full model, indicating that enterprise historical cooperation, upstream and downstream associations, and risk co-occurrence information can supplement group-level anomaly patterns that are difficult to capture from a single declaration sample. When only document and structured fields were used, the F1-score decreased to

0.758

, and when only structured fields were used, it further decreased to

0.728

, demonstrating that single-modality or simple dual-modality inputs cannot sufficiently cover the multisource risk mechanisms in real cross-border scenarios.

From the perspective of model structure, removing cross-modal attention reduced the F1-score to

0.777

, indicating that simple concatenation cannot effectively establish fine-grained correspondences among document semantics, declaration fields, and entity relationships, such as the implicit consistency among commodity descriptions and HS codes, declared weights and commodity attributes, and logistics routes. Removing the contrastive alignment loss resulted in an F1-score of

0.797

, showing that cross-modal contrastive learning can enhance latent-space consistency among different sensing signals from the same sample and reduce semantic fragmentation across modalities. After the causal debiasing module was removed, the F1-score decreased to

0.791

, suggesting that if only statistical correlations are learned, the model is more likely to be affected by pseudo-correlated factors such as regions, ports, transportation modes, and textual styles, resulting in insufficiently stable risk representations. Removing the counterfactual response module led to an F1-score of

0.803

, indicating that counterfactual perturbations can enhance model sensitivity to key intervenable factors and guide the model to focus more on variables that truly drive risk changes. After engineering constraints were removed, the F1-score decreased to

0.786

, demonstrating that business constraints such as consistency between commodity descriptions and HS codes, matching between declared and inspected weights, and reasonableness of logistics paths provide more stable structural priors for the model. Overall, the full model achieves the best performance because its mathematical modeling process does not merely expand the feature dimensionality. Instead, cross-modal attention is used to learn dependencies among heterogeneous signals, contrastive constraints are used to reduce representation distances among signals from the same sample, causal debiasing is used to reduce the interference of environmental variables in risk representations, and counterfactual responses are used to strengthen decision boundaries along key risk directions. Therefore, simultaneous advantages are achieved in accuracy, recall capability, and imbalanced-risk identification.

4.5. Hyperparameter Sensitivity Analysis

To evaluate the robustness of the chosen configuration and to systematically investigate the impact of key hyperparameters on model performance, a comprehensive hyperparameter sensitivity analysis was conducted. This experiment aims to justify the selected values by observing the fluctuations in the final prediction metrics when individual hyperparameters are varied within a reasonable range. The analysis focuses on four core parameters: batch size, learning rate

α

, causal constraint loss weight

λ_{1}

, and counterfactual loss weight

λ_{2}

. During the evaluation, one target hyperparameter was adjusted while the others were kept at their optimal baseline settings. The F1-score and AUC metrics derived from the 5-fold cross-validation were recorded to measure the model sensitivity.

As shown in Table 5, the proposed framework demonstrates a high degree of robustness across different hyperparameter configurations, with F1-score and AUC values remaining relatively stable within the tested ranges. For the batch size, performance peaks at 32, balancing gradient stability and adequate noise for generalization, whereas a larger batch size of 64 slightly degrades the F1-score to

0.819

. Regarding the learning rate, the model is somewhat sensitive; an optimal value of

2 \times 10^{- 5}

effectively prevents underfitting observed at

1 \times 10^{- 5}

and optimization divergence observed at

5 \times 10^{- 5}

. The loss weights

λ_{1}

and

λ_{2}

control the contribution of the causal debiasing and counterfactual response modules, respectively. When

λ_{1}

is set to

0.5

and

λ_{2}

to

0.3

, the model achieves the best balance between accurate anomaly classification and robust causal representation. Deviating from these optimal values results in only minor performance drops, indicating that the framework does not excessively rely on rigorous hyperparameter tuning to achieve state-of-the-art capability. These results systematically justify the parameter choices reported in the experimental setup.

4.6. Discussion

The results indicate that cross-border logistics and financial anomaly monitoring cannot rely solely on a single declaration field or document text. Instead, the entire cargo and trade-finance circulation process should be considered, and multisource information such as electronic documents, declaration fields, logistics trajectories, port inspections, cold-chain environments, and entity relationships should be comprehensively utilized [48]. In practical port supervision scenarios, a batch of goods may contain a seemingly standardized commodity name in the declaration form, while subtle inconsistencies exist between the HS code and the commodity description. At the same time, the declared price may be lower than the historical interval of similar commodities, and the logistics trajectory may show abnormal transshipment [49]. If only manual rules or a single structured model are used, such anomalies may be fragmented into several insignificant local signals [50]. In contrast, the proposed method jointly models document semantics, declared prices, transportation routes, and entity relationships, making it easier to identify financial risks hidden in the coupling relationships among multisource signals. Especially in scenarios such as port container clearance, batch review of cross-border e-commerce parcels, agricultural product cold-chain export, and bonded warehouse entry and exit supervision, this method can provide regulators with more targeted risk-ranking results, enabling limited inspection resources to be prioritized toward more suspicious cargo batches.

From the perspective of practical value, the proposed method focuses not only on whether an anomaly exists, but also on where the anomaly may originate. For example, in agricultural product cold-chain export scenarios, if the declaration information for a batch of fruit or aquatic products is relatively complete, but temperature and humidity sensor records indicate a prolonged temperature abnormality during transportation and electronic seal records indicate unplanned container opening, the model can jointly incorporate environmental-state anomalies and logistics-node anomalies into the decision, thereby providing evidence for cargo-quality risk assessment, document-consistency verification, and financial compliance review [51]. In port clearance scenarios, if the weighbridge result deviates substantially from the declared weight, the X-ray image indicates inconsistency between loading density and the declared commodity type, and the enterprise historical transaction relationship shows an abnormally high-frequency cooperation chain, the model can jointly explain weight anomalies, inspection-image anomalies, and entity-relationship anomalies as sources of high risk [52]. This explanation pattern is closer to the practical working logic of regulators, because port inspection usually does not rely on a single indicator for direct determination, but instead requires a comprehensive judgment based on documents, goods, routes, device records, and enterprise backgrounds.

In addition, the cross-time, cross-region, cross-port, and cross-entity testing results demonstrate that the proposed method maintains relatively stable performance under changing environments, which is important for real deployment. Commodity prices, transportation routes, enterprise cooperation relationships, and declaration habits in cross-border trade continuously change with seasons, markets, and policies. If a model only learns superficial correlations in the training data, it may easily fail when applied to new ports, new enterprises, or new routes. Through causal debiasing and engineering constraints, the proposed method focuses more on stable rules, such as consistency between commodity descriptions and HS codes, rationality of declared prices, matching between weight and inspection records, continuity of logistics paths, and abnormal coupling of entity relationships. Therefore, it is more suitable for continuously changing regulatory and financial security environments. For regulatory departments, this method can be embedded into existing workflows as an intelligent auxiliary screening tool, providing risk priorities, anomaly-source prompts, and key inspection suggestions before manual review, thereby improving port clearance efficiency and financial anomaly identification accuracy.

4.7. Limitations and Future Work

Although a multisource sensing signal fusion framework for cross-border logistics anomaly monitoring was constructed and its effectiveness was verified under multiple experimental scenarios, certain limitations remain. First, although the data used in this study cover multiple types of information, including electronic documents, structured declaration fields, logistics trajectories, port inspection records, and cold-chain environmental states, data sources in real cross-border scenarios are more complex. Data standards are not fully consistent across different ports, enterprises, and logistics platforms, and some sensor records may suffer from missing values, delays, or inconsistent acquisition frequencies. Second, although causal debiasing and counterfactual response mechanisms were introduced, counterfactual samples were mainly generated in the latent representation space. Therefore, the reliability of the explanation results still needs to be further validated using regulatory expert knowledge and real intervention cases. In addition, hardware sensing data such as X-ray images, video records, and electronic seals have different coverage levels across scenarios, and model stability under extreme modality-missing conditions still needs to be further improved. Future work will be advanced in three directions. First, data sources will be further expanded by incorporating data from more ports, comprehensive bonded zones, cross-border e-commerce platforms, and cold-chain logistics enterprises, so as to improve the adaptability of the model to different scenarios. Second, fine-grained modeling of multisource hardware sensing data will be strengthened. For example, X-ray image features, video temporal behaviors, electronic seal opening and closing sequences, and continuous changes in cold-chain environments will be further fused, enabling the model to more accurately perceive cargo physical states and logistics process anomalies. Third, lightweight deployment schemes for practical regulatory systems will be explored to reduce inference latency and computational resource consumption. Expert feedback mechanisms will also be incorporated to continuously calibrate model explanation results, thereby promoting the practical application of the proposed method in intelligent port supervision, cross-border logistics security, and supply chain risk early warning.

5. Conclusions

To meet the requirements of multisource heterogeneous data perception and financial anomaly risk identification in cross-border circulation processes, a multisource sensing signal fusion and causally explainable risk identification framework is proposed. In this framework, electronic documents, structured declaration fields, GPS/AIS logistics trajectories, port weighing records, RFID data, electronic seal status, X-ray inspection images, cold-chain temperature and humidity records, and transportation vibration records are uniformly modeled as multisource sensing signals in cross-border trade scenarios. Accordingly, the traditional document and financial compliance review problem is extended into an engineering-oriented anomaly monitoring task for intelligent port supervision. In terms of methodological design, a multisource cross-border sensing signal representation and cross-modal alignment module is constructed to model the interactions among textual semantics, attribute fields, logistics status, device records, and entity relationships. Furthermore, an engineering-constraint-guided causal risk representation mechanism is introduced to reduce the interference of spurious correlated factors, such as regions, ports, transportation modes, and textual styles, in model judgments. Meanwhile, a counterfactual anomaly response module is constructed to enhance model sensitivity and interpretability with respect to true anomaly-driving factors through intervention analysis of key variables.

Experimental results show that the proposed method achieves the best overall performance in the cross-border financial anomaly detection task, with Accuracy, Precision, Recall, F1-score, AUC, and PR-AUC reaching

0.927

,

0.842

,

0.811

,

0.826

,

0.958

, and

0.817

, respectively, clearly outperforming baseline models including Logistic Regression, Random Forest, XGBoost, BERT, BERT+MLP, and Multimodal Transformer. In cross-time, cross-region, cross-port, and cross-entity testing scenarios, high F1-score and AUC values are still maintained, indicating favorable cross-scenario generalization capability. Under complex conditions such as text noise, missing modalities, logistics trajectory perturbations, and missing sensing records, only limited performance degradation is observed, verifying the robustness of the proposed method against incomplete sensing data and noise interference in real-world scenarios. Ablation experiments further demonstrate that cross-modal attention, contrastive alignment, causal debiasing, counterfactual response, and engineering constraints all make important contributions to performance improvement. Overall, a feasible and explainable technical pathway is provided for AI-driven multisource sensing data fusion, intelligent port anomaly monitoring, and supply-chain financial risk early warning in cross-border scenarios.

Despite the promising results, this study has several limitations that should be acknowledged and addressed in future work. First, the proposed framework relies heavily on supervised learning using historical labeled anomalies, which intrinsically constrains its ability to discover entirely novel, unknown trade-finance fraud patterns without prior annotation. Second, while the constructed dataset is extensive, it remains constrained to specific regions and cooperative enterprises, and the lack of evaluation against standardized external benchmarks limits the broader comparative assessment of the framework. Furthermore, the current evaluations are primarily conducted in an offline environment, meaning the absence of real-world online testing and deployment-level validation leaves potential challenges related to real-time processing latency and continuous model updating unexplored. Future research will focus on integrating unsupervised or semi-supervised anomaly discovery mechanisms, evaluating the framework on broader open-source benchmarks, and deploying the system in live port supervision environments to validate its real-time operational reliability.

Author Contributions

Conceptualization, J.Y., Z.L., B.X. and M.L.; Methodology, J.Y., Z.L. and B.X.; Software, J.Y., Z.L. and B.X.; Validation, Y.L.; Formal analysis, Y.L.; Investigation, Y.L.; Resources, K.S. and R.L.; Data curation, K.S. and R.L.; Writing – original draft, J.Y., Z.L., B.X., K.S., R.L., Y.L. and M.L.; Visualization, K.S. and R.L.; Supervision, M.L.; Project administration, M.L.; Funding acquisition, M.L.; J.Y., Z.L. and B.X. contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China grant number 61202479.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fabozzi, F.J.; Markowitz, H.M.; Gupta, F. Portfolio selection. Handb. Financ. 2008, 2, 3–13. [Google Scholar] [CrossRef]
Sharpe, W.F. Capital asset prices: A theory of market equilibrium under conditions of risk. J. Financ. 1964, 19, 425–442. [Google Scholar] [CrossRef]
Chen, A.; Wei, Y.; Le, H.; Zhang, Y. Learning by teaching with ChatGPT: The effect of teachable ChatGPT agent on programming education. Br. J. Educ. Technol. 2024, 57, 163–184. [Google Scholar]
Jorion, P. Value at Risk: The New Benchmark for Managing Financial Risk; McGraw-Hill: New York, NY, USA, 1997; Volume 2. [Google Scholar]
Rockafellar, R.T.; Uryasev, S. Optimization of conditional value-at-risk. J. Risk 2000, 2, 21–42. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; PMLR: New York, NY, USA, 2020; pp. 1597–1607. [Google Scholar]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Dangsawang, B.; Nuchitprasitchai, S. A machine learning approach for detecting customs fraud through unstructured data analysis in social media. Decis. Anal. J. 2024, 10, 100408. [Google Scholar] [CrossRef]
He, D. A multimodal deep neural network-based financial fraud detection model via collaborative awareness of semantic analysis and behavioral modeling. J. Circuits Syst. Comput. 2025, 34, 2550054. [Google Scholar]
Kim, D. AI-Driven Cross-Modal Feature Alignment for Multimodal Fraud Detection in Financial Transaction Systems. J. Glob. Eng. Rev. 2026, 4, 21–37. [Google Scholar]
Bello, H.O.; Ige, A.B.; Ameyaw, M.N. Deep learning in high-frequency trading: Conceptual challenges and solutions for real-time fraud detection. World J. Adv. Eng. Technol. Sci. 2024, 12, 35–46. [Google Scholar]
Chowdhury, R.; Talukder, K.A. Assessing The Role of Statistical Modeling Techniques in Fraud Detection Across Procurement And International Trade Systems. Am. J. Interdiscip. Stud. 2022, 3, 91–125. [Google Scholar] [CrossRef]
Ma, S.; Xie, W.; Li, D.; Li, H.; Li, Y. Reducing spurious correlation for federated domain generalization. arXiv 2024, arXiv:2407.19174. [Google Scholar]
Will, I. Explainability vs. Performance Trade-offs: Governance Challenges in Deploying Black-Box Models for Generative Fraud Detection. 2024. Available online: https://www.researchgate.net/profile/Isreal-Will/publication/403447005_Explainability_vs_Performance_Trade-offs_Governance_Challenges_in_Deploying_Black-Box_Models_for_Generative_Fraud_Detection/links/69ce686b60c0371a60f56dc3/Explainability-vs-Performance-Trade-offs-Governance-Challenges-in-Deploying-Black-Box-Models-for-Generative-Fraud-Detection.pdf (accessed on 1 June 2024).
Mishra, L.N.; Senapati, B. FraudFusion: A Multi-Modal Ensemble Framework for Anomaly-Based Fraud Detection in Retail POS Systems. IEEE Access 2026, 14, 10436–10456. [Google Scholar]
Zitis, P.I.; Potirakis, S.M.; Alexandridis, A. Forecasting forex market volatility using deep learning models and complexity measures. J. Risk Financ. Manag. 2024, 17, 557. [Google Scholar] [CrossRef]
Huang, W.R.; Perez, M.A. Accurate, data-efficient learning from noisy, choice-based labels for inherent risk scoring. arXiv 2018, arXiv:1811.10791. [Google Scholar]
Gu, S.; Kelly, B.; Xiu, D. Empirical asset pricing via machine learning. Rev. Financ. Stud. 2020, 33, 2223–2273. [Google Scholar] [CrossRef]
Chen, L.; Pelger, M.; Zhu, J. Deep learning in asset pricing. Manag. Sci. 2024, 70, 714–750. [Google Scholar] [CrossRef]
Bucci, A. Realized volatility forecasting with neural networks. J. Financ. Econom. 2020, 18, 502–531. [Google Scholar] [CrossRef]
Zhou, X.; Chen, S.; Ren, Y.; Zhang, Y.; Fu, J.; Fan, D.; Lin, J.; Wang, Q. Atrous Pyramid GAN Segmentation Network for Fish Images with High Performance. Electronics 2022, 11, 911. [Google Scholar] [CrossRef]
Wang, C.; Chen, Y.; Zhang, S.; Zhang, Q. Stock market index prediction using deep Transformer model. Expert Syst. Appl. 2022, 208, 118128. [Google Scholar] [CrossRef]
Yañez, C.; Kristjanpoller, W.; Minutolo, M.C. Stock market index prediction using transformer neural network models and frequency decomposition. Neural Comput. Appl. 2024, 36, 15777–15797. [Google Scholar] [CrossRef]
Ruiru, D.; Jouandeau, N.; Owuor, D. Recurrent Neural Networks with Transformers to Trade Financial Instruments. SN Comput. Sci. 2025, 6, 934. [Google Scholar] [CrossRef]
Sezer, O.B.; Gudelek, M.U.; Ozbayoglu, A.M. Financial time series forecasting with deep learning: A systematic literature review: 2005–2019. Appl. Soft Comput. 2020, 90, 106181. [Google Scholar] [CrossRef]
Olorunnimbe, K.; Viktor, H. Deep learning in the stock market—A systematic survey of practice, backtesting, and applications. Artif. Intell. Rev. 2023, 56, 2057–2109. [Google Scholar] [PubMed]
Ma, Y.; Ventre, C.; Polukarov, M. Denoised labels for financial time series data via self-supervised learning. In Proceedings of the Third ACM International Conference on AI in Finance, New York, NY, USA, 2–4 November 2022; pp. 471–479. [Google Scholar]
Song, C.H.; Liu, S.; Chen, L. The Label Horizon Paradox: Rethinking Supervision Targets in Financial Forecasting. arXiv 2026, arXiv:2602.03395. [Google Scholar]
Tobing, V.; Hamzah, M.; Chew, J.; Hamdan, M. Towards Digital Twin Integrated Border Control. In Proceedings of the 2024 IEEE International Conference on Computing (ICOCO), Kuala Lumpur, Malaysia, 12–14 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 114–121. [Google Scholar]
Oord, A.V.D.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Sun, J.; Qing, Y.; Liu, C.; Lin, J. Self-fts: A self-supervised learning method for financial time series representation in stock intraday trading. In Proceedings of the 2022 IEEE 20th International Conference on Industrial Informatics (INDIN), Perth, Australia, 25–28 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 501–506. [Google Scholar]
Hwang, Y.; Zohren, S.; Lee, Y. Temporal Representation Learning for Stock Similarities and Its Applications in Investment Management. arXiv 2024, arXiv:2407.13751. [Google Scholar]
Abdulsahib, H.M.; Ghaderi, F. Cross-domain disentanglement: A novel approach to financial market prediction. IEEE Access 2024, 12, 16255–16265. [Google Scholar] [CrossRef]
Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Huang, X.; Khetan, A.; Cvitkovic, M.; Karnin, Z. Tabtransformer: Tabular data modeling using contextual embeddings. arXiv 2020, arXiv:2012.06678. [Google Scholar]
Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 260–270. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Zhao, Z.; Xiao, Z.; Tao, J. MSDG: Multi-scale dynamic graph neural network for industrial time series anomaly detection. Sensors 2024, 24, 7218. [Google Scholar] [CrossRef] [PubMed]
Qu, X.; Liu, Z.; Wu, C.Q.; Hou, A.; Yin, X.; Chen, Z. Mfgan: Multimodal fusion for industrial anomaly detection using attention-based autoencoder and generative adversarial network. Sensors 2024, 24, 637. [Google Scholar] [CrossRef] [PubMed]
Shao, Y.; Wang, S.; Zhao, W. A causality-aware perspective on domain generalization via domain intervention. Electronics 2024, 13, 1891. [Google Scholar] [CrossRef]
Papadakis, A.; Spyrou, E. A multi-modal egocentric activity recognition approach towards video domain generalization. Sensors 2024, 24, 2491. [Google Scholar] [CrossRef] [PubMed]
Chen, H. Optimization of Real-time Logistics Tracking and Anomaly Warning System for Cross-border E-commerce Based on Edge Computing and Long Short-Term Memory (LSTM). In Proceedings of the 2025 5th International Conference on Computer, Internet of Things and Control Engineering (CITCE), Guangzhou, China, 19–21 December 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 147–153. [Google Scholar]
Xi, Y.; Jia, X.; Zhang, H. Real-time multimodal route optimization and anomaly detection for cross-border logistics using deep reinforcement learning. Acad. Nexus J. 2024, 3, 133–142. [Google Scholar]
Shi, J. An Intelligent Logistics Scheduling System Based on Big Data and IoT for Cross-Border Trade: A Case Study of China-Armenia Trade Corridor. In Proceedings of the 2026 3rd International Conference on Autonomous Driving and Intelligent Sensing Technology, Wuhan, China, 6–8 February 2026; pp. 226–230. [Google Scholar]
Ceylan, B.O.; Elidolu, G.; Sezer, S.I.; Akyuz, E.; Yang, Z. Probabilistic risk assessment for inert gas system on oil tanker ships using system theoretic accident model and process (STAMP) and Bayesian belief network (BBN). Reliab. Eng. Syst. Saf. 2025, 266, 111669. [Google Scholar] [CrossRef]
Wu, X.; Feng, S.; Li, J.; Du, Z.; Tang, R.; Wang, J.; Leng, P. Financial Disclosure Identification and Risk Assessment Model for Listed Companies Based on BERT and Graph Neural Networks. In Proceedings of the 2025 International Conference on Economic Management and Big Data Application, Shenzhen, China, 25–27 July 2025; pp. 383–388. [Google Scholar]

Figure 1. Geographical location of Tianjin Port.

Figure 2. The multisource sensing representation and cross-modal alignment module fuses document texts, declaration fields, and entity relationships into a unified risk representation.

Figure 3. The engineering-constraint-guided causal risk representation debiasing module separates stable risk features from environmental bias through causal gating, adversarial environment learning, and orthogonality constraints.

Figure 4. The counterfactual anomaly response enhancement module constructs intervention representations for key risk factors and compares risk response changes to improve the identification and explanation of true anomaly drivers.

Figure 5. ROC curve comparison of different models for cross-border fraud detection, where the proposed method achieves the highest AUC and demonstrates superior risk identification performance.

Figure 6. Performance comparison across different evaluation scenarios shows that the proposed method maintains stable detection performance under random split, cross-time, cross-region, cross-port, cross-entity, and multiple perturbation settings.

Figure 7. The F1-score drop comparison after module ablation shows that the structured declaration modality, cross-modal attention, and document semantic modality contribute most significantly to model performance.

Table 1. Summary of multisource cross-border data and hardware sensors.

Data Name	Data Volume	Sensor Type
Electronic documents and commodity texts	128,460	Scanner/OCR device
Structured declaration fields	86,320	Port terminal/Declaration terminal
GPS land trajectories (Total)	42,680	GPS sensor
Proprietary operational GPS data	34,144	GPS sensor
Public GPS data	8536	GPS sensor
AIS maritime trajectories (Total)	31,900	AIS device
Proprietary operational AIS data	19,140	AIS device
Public AIS data	12,760	AIS device
Port weighing data	39,760	Weighbridge/Dynamic weighing sensor
X-ray inspection images	18,240	X-ray inspection device
RFID identification data	28,450	RFID reader/RFID tag
Electronic seal status data	24,630	Electronic seal/Door sensor
Cold-chain temperature and humidity data	35,720	Temperature/Humidity sensor
Transportation vibration data	16,850	Acceleration/Vibration sensor
Port video and image data	21,360	HD camera/LPR camera
Anomaly risk labels (Total)	86,320	Inspection/Review terminal
Positive anomaly samples	7520	Inspection/Review terminal
Negative normal samples	78,800	Inspection/Review terminal

Table 2. Performance comparison with baseline models on cross-border logistics anomaly monitoring.

Model	Accuracy	Precision	Recall	F1-Score	AUC	PR-AUC
Logistic Regression	$0.821 \pm 0.006$	$0.684 \pm 0.007$	$0.612 \pm 0.008$	$0.646 \pm 0.007$	$0.842 \pm 0.005$	$0.631 \pm 0.008$
Random Forest	$0.846 \pm 0.005$	$0.713 \pm 0.006$	$0.657 \pm 0.007$	$0.684 \pm 0.006$	$0.871 \pm 0.005$	$0.669 \pm 0.006$
XGBoost	$0.862 \pm 0.004$	$0.736 \pm 0.005$	$0.681 \pm 0.006$	$0.707 \pm 0.005$	$0.889 \pm 0.004$	$0.694 \pm 0.005$
MLP	$0.851 \pm 0.005$	$0.721 \pm 0.006$	$0.663 \pm 0.007$	$0.691 \pm 0.006$	$0.876 \pm 0.005$	$0.681 \pm 0.006$
TabTransformer	$0.873 \pm 0.004$	$0.754 \pm 0.005$	$0.704 \pm 0.006$	$0.728 \pm 0.005$	$0.903 \pm 0.004$	$0.718 \pm 0.005$
TextCNN	$0.835 \pm 0.006$	$0.701 \pm 0.007$	$0.639 \pm 0.008$	$0.669 \pm 0.007$	$0.861 \pm 0.005$	$0.654 \pm 0.007$
BiLSTM	$0.842 \pm 0.005$	$0.708 \pm 0.006$	$0.648 \pm 0.007$	$0.677 \pm 0.006$	$0.868 \pm 0.005$	$0.662 \pm 0.006$
BERT	$0.867 \pm 0.004$	$0.748 \pm 0.005$	$0.697 \pm 0.006$	$0.722 \pm 0.005$	$0.898 \pm 0.004$	$0.711 \pm 0.005$
BERT+MLP	$0.889 \pm 0.004$	$0.781 \pm 0.005$	$0.736 \pm 0.006$	$0.758 \pm 0.005$	$0.921 \pm 0.003$	$0.746 \pm 0.005$
Multimodal Transformer	$0.903 \pm 0.004$	$0.806 \pm 0.004$	$0.762 \pm 0.005$	$0.783 \pm 0.004$	$0.936 \pm 0.003$	$0.771 \pm 0.004$
MSDG	$0.908 \pm 0.004$	$0.812 \pm 0.004$	$0.775 \pm 0.005$	$0.793 \pm 0.004$	$0.941 \pm 0.003$	$0.782 \pm 0.004$
MFGAN	$0.912 \pm 0.003$	$0.818 \pm 0.004$	$0.784 \pm 0.005$	$0.801 \pm 0.004$	$0.945 \pm 0.003$	$0.791 \pm 0.004$
CSRDN	$0.915 \pm 0.003$	$0.825 \pm 0.004$	$0.789 \pm 0.004$	$0.806 \pm 0.004$	$0.948 \pm 0.003$	$0.796 \pm 0.004$
MDGAR	$0.918 \pm 0.003$	$0.831 \pm 0.004$	$0.796 \pm 0.004$	$0.813 \pm 0.004$	$0.951 \pm 0.003$	$0.804 \pm 0.004$
Proposed method	$0.927 \pm 0.003$	$0.842 \pm 0.004$	$0.811 \pm 0.005$	$0.826 \pm 0.004$	$0.958 \pm 0.002$	$0.817 \pm 0.004$
p-value (vs. best baseline)	$0.014$	$0.021$	$0.008$	$0.011$	$0.003$	$0.015$

Table 3. Generalization and robustness evaluation under different cross-border monitoring scenarios.

Evaluation Scenario	Accuracy	F1-Score	AUC	Top-K Hit Rate	Early Warning Gain
Random test split	$0.927 \pm 0.003$	$0.826 \pm 0.004$	$0.958 \pm 0.002$	$0.873 \pm 0.004$	$0.861 \pm 0.005$
Cross-time testing	$0.913 \pm 0.005$	$0.804 \pm 0.006$	$0.944 \pm 0.004$	$0.852 \pm 0.005$	$0.837 \pm 0.006$
Cross-region testing	$0.906 \pm 0.006$	$0.793 \pm 0.007$	$0.936 \pm 0.005$	$0.841 \pm 0.006$	$0.826 \pm 0.007$
Cross-port testing	$0.901 \pm 0.006$	$0.786 \pm 0.007$	$0.931 \pm 0.005$	$0.834 \pm 0.006$	$0.819 \pm 0.007$
Cross-entity testing	$0.895 \pm 0.007$	$0.777 \pm 0.008$	$0.924 \pm 0.006$	$0.826 \pm 0.007$	$0.811 \pm 0.008$
Text noise perturbation	$0.909 \pm 0.005$	$0.798 \pm 0.006$	$0.939 \pm 0.004$	$0.846 \pm 0.005$	$0.831 \pm 0.006$
Missing modality simulation	$0.897 \pm 0.006$	$0.781 \pm 0.007$	$0.928 \pm 0.005$	$0.829 \pm 0.006$	$0.814 \pm 0.007$
Logistics trajectory perturbation	$0.905 \pm 0.006$	$0.791 \pm 0.007$	$0.934 \pm 0.005$	$0.839 \pm 0.006$	$0.823 \pm 0.007$
Sensor record missing	$0.899 \pm 0.006$	$0.784 \pm 0.007$	$0.930 \pm 0.005$	$0.831 \pm 0.006$	$0.817 \pm 0.007$

Table 4. Ablation study of the proposed multisource sensing fusion and causal explanation framework.

Model Variant	Accuracy	Precision	Recall	F1-Score	AUC	PR-AUC
Full model	$0.927 \pm 0.003$	$0.842 \pm 0.004$	$0.811 \pm 0.005$	$0.826 \pm 0.004$	$0.958 \pm 0.002$	$0.817 \pm 0.004$
w/o document semantic modality	$0.902 \pm 0.005$	$0.803 \pm 0.006$	$0.764 \pm 0.007$	$0.783 \pm 0.006$	$0.934 \pm 0.004$	$0.772 \pm 0.005$
w/o structured declaration modality	$0.894 \pm 0.006$	$0.791 \pm 0.007$	$0.752 \pm 0.008$	$0.771 \pm 0.007$	$0.926 \pm 0.005$	$0.759 \pm 0.007$
w/o entity relationship modality	$0.908 \pm 0.005$	$0.811 \pm 0.006$	$0.772 \pm 0.007$	$0.791 \pm 0.006$	$0.941 \pm 0.004$	$0.781 \pm 0.006$
w/o cross-modal attention	$0.899 \pm 0.005$	$0.798 \pm 0.006$	$0.758 \pm 0.007$	$0.777 \pm 0.006$	$0.931 \pm 0.005$	$0.766 \pm 0.006$
w/o contrastive alignment loss	$0.910 \pm 0.004$	$0.816 \pm 0.005$	$0.779 \pm 0.006$	$0.797 \pm 0.005$	$0.943 \pm 0.004$	$0.786 \pm 0.005$
w/o causal debiasing module	$0.907 \pm 0.005$	$0.809 \pm 0.006$	$0.773 \pm 0.007$	$0.791 \pm 0.006$	$0.939 \pm 0.004$	$0.779 \pm 0.006$
w/o counterfactual response module	$0.914 \pm 0.004$	$0.821 \pm 0.005$	$0.786 \pm 0.006$	$0.803 \pm 0.005$	$0.946 \pm 0.003$	$0.792 \pm 0.005$
w/o engineering constraints	$0.904 \pm 0.005$	$0.806 \pm 0.006$	$0.767 \pm 0.007$	$0.786 \pm 0.006$	$0.936 \pm 0.004$	$0.775 \pm 0.006$
Only document + structured fields	$0.889 \pm 0.006$	$0.781 \pm 0.007$	$0.736 \pm 0.008$	$0.758 \pm 0.007$	$0.921 \pm 0.005$	$0.746 \pm 0.007$
Only structured fields	$0.873 \pm 0.007$	$0.754 \pm 0.008$	$0.704 \pm 0.009$	$0.728 \pm 0.008$	$0.903 \pm 0.006$	$0.718 \pm 0.008$
p-value (vs. best variant)	$0.012$	$0.018$	$0.009$	$0.015$	$0.004$	$0.017$

Table 5. Hyperparameter sensitivity analysis of the proposed framework.

Hyperparameter	Evaluated Value	F1-Score	AUC
Batch size	16	$0.812 \pm 0.005$	$0.945 \pm 0.004$
Batch size	32	$0.826 \pm 0.004$	$0.958 \pm 0.002$
Batch size	64	$0.819 \pm 0.006$	$0.952 \pm 0.005$
Learning rate $α$	$1 \times 10^{- 5}$	$0.815 \pm 0.006$	$0.949 \pm 0.004$
Learning rate $α$	$2 \times 10^{- 5}$	$0.826 \pm 0.004$	$0.958 \pm 0.002$
Learning rate $α$	$5 \times 10^{- 5}$	$0.808 \pm 0.007$	$0.941 \pm 0.006$
Causal loss weight $λ_{1}$	$0.3$	$0.817 \pm 0.005$	$0.950 \pm 0.004$
Causal loss weight $λ_{1}$	$0.5$	$0.826 \pm 0.004$	$0.958 \pm 0.002$
Causal loss weight $λ_{1}$	$0.7$	$0.821 \pm 0.005$	$0.955 \pm 0.003$
Counterfactual loss weight $λ_{2}$	$0.1$	$0.814 \pm 0.006$	$0.946 \pm 0.005$
Counterfactual loss weight $λ_{2}$	$0.3$	$0.826 \pm 0.004$	$0.958 \pm 0.002$
Counterfactual loss weight $λ_{2}$	$0.5$	$0.819 \pm 0.005$	$0.951 \pm 0.004$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yin, J.; Lu, Z.; Xiong, B.; Sun, K.; Liu, R.; Liu, Y.; Li, M. Multisource Port Inspection Sensor Fusion with Causal Representation Learning for Cross-Border Anomaly Monitoring. Sensors 2026, 26, 4142. https://doi.org/10.3390/s26134142

AMA Style

Yin J, Lu Z, Xiong B, Sun K, Liu R, Liu Y, Li M. Multisource Port Inspection Sensor Fusion with Causal Representation Learning for Cross-Border Anomaly Monitoring. Sensors. 2026; 26(13):4142. https://doi.org/10.3390/s26134142

Chicago/Turabian Style

Yin, Jiaxin, Zhengjia Lu, Baodi Xiong, Kai Sun, Ruijia Liu, Yachi Liu, and Manzhou Li. 2026. "Multisource Port Inspection Sensor Fusion with Causal Representation Learning for Cross-Border Anomaly Monitoring" Sensors 26, no. 13: 4142. https://doi.org/10.3390/s26134142

APA Style

Yin, J., Lu, Z., Xiong, B., Sun, K., Liu, R., Liu, Y., & Li, M. (2026). Multisource Port Inspection Sensor Fusion with Causal Representation Learning for Cross-Border Anomaly Monitoring. Sensors, 26(13), 4142. https://doi.org/10.3390/s26134142

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multisource Port Inspection Sensor Fusion with Causal Representation Learning for Cross-Border Anomaly Monitoring

Abstract

1. Introduction

2. Related Work

2.1. Cross-Border Data Anomaly Monitoring and Risk Identification

2.2. Multisource Sensing Signal Fusion and Multimodal Deep Learning

2.3. Causal Inference and Explainable Anomaly Identification

3. Materials and Method

3.1. Data Preprocessing and Augmentation Strategy

3.2. Proposed Method

3.2.1. Overall Architecture

3.2.2. Multisource Sensing Representation and Cross-Modal Alignment Module

3.2.3. Engineering-Constraint-Guided Causal Risk Representation Debiasing Module

3.2.4. Counterfactual Anomaly Response Enhancement Module

4. Results and Discussion

4.1. Experimental Configuration

4.1.1. Hardware and Software Platform

4.1.2. Baseline Models and Evaluation Metrics

4.2. Performance Comparison with Baseline Models

4.3. Generalization and Robustness Evaluation

4.4. Ablation Study

4.5. Hyperparameter Sensitivity Analysis

4.6. Discussion

4.7. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI