1. Introduction
Short-form video platforms such as TikTok have become central arenas for sharing content related to migration journeys, border crossings, and humanitarian crises [
1,
2]. Videos in this context are typically multimodal, combining visual scenes, on-screen text, audio, and metadata in ways that complicate automated analysis. Identifying which videos depict or reference irregular migration is particularly challenging when critical cues appear in overlaid text or visual patterns while captions are short, multilingual, or noisy. Existing computational research on social media often focuses on text-only content or static images, leaving the joint modeling of video signals, OCR text, and contextual metadata comparatively underexplored for migration-related narratives [
3,
4]. Rather than pursuing end-to-end deep architectures, this study deliberately adopts a feature-based, interpretable modeling strategy. This choice reflects the constraints of sensitive domains such as irregular migration, where data availability is limited, annotation is costly, and transparency of automated decisions is essential. The goal is therefore not to maximize raw predictive performance, but to understand which multimodal cues consistently signal migration-related narratives under realistic data-scarcity conditions. Rather than competing with end-to-end multimodal architectures, this work addresses a complementary question: which interpretable multimodal cues remain reliable under extreme data scarcity. This study does not aim to achieve a semantic understanding of migration narratives but rather to detect migration-indicative multimodal cues that can support monitoring and weakly supervised data collection under data scarcity.
1.1. Background
Prior work on multimodal classification of social media posts has shown that combining textual and visual features can substantially improve the detection of events, misinformation, and crisis-related content compared to single-modality approaches [
4,
5]. Studies of TikTok in particular highlight the platform’s complexity, with fast-paced editing, on-screen captions, and diverse creator communities that make conventional text-based pipelines insufficient. For sensitive domains such as migration, this multimodal complexity intersects with ethical and practical constraints on data access, annotation, and sharing, which in turn limits the availability of public, well-documented datasets [
6,
7].
1.2. Motivation
In the context of irregular migration, TikTok videos may depict boats, sea crossings, group movement, or coastal regions that implicitly signal migration routes, while captions are often in Arabic or other non-English languages [
7]. As a result, moderation and research workflows that rely only on captions or hashtags risk missing relevant content or misclassifying ambiguous posts. At the same time, labeled datasets in this space remain scarce and typically small-scale, making it difficult to benchmark multimodal approaches and to understand which signals are most informative for migration-related classification. There is therefore a need for pilot studies that systematically compare modalities and extract interpretable insights that can inform larger-scale, weakly supervised labeling in future work [
8,
9,
10].
Building on our prior research on irregular migration discourse on Twitter [
11] and its evolution across X (formerly Twitter) and Telegram [
12], this study addresses the significantly higher complexity of the TikTok video landscape. While our previous work demonstrated the efficacy of text-based Transformer models in low-resource settings, short-form video presents a dual challenge: relevant cues are dispersed across modalities (visual, audio, OCR), and the content itself is highly ephemeral due to platform moderation.
1.3. Contributions
This paper offers a small-scale, multimodal study of migration-related content detection on TikTok using interpretable feature fusion. First, it introduces a manually annotated dataset of 50 videos (35 migration-related, 15 non-related) enriched with features derived from platform metadata, automated video analysis, and OCR text extracted from video frames and images. Second, it benchmarks text-only, OCR-only, and vision-only baselines against a fusion configuration using Logistic Regression, Random Forest, and XGBoost classifiers, showing that multimodal fusion consistently outperforms single-modality models in stratified cross-validation. Third, it provides feature-importance and SHAP analyses that highlight the central role of OCR-derived keywords, maritime labels, and regional indicators for migration-related classification, and discusses how these findings can guide future weakly supervised labeling at larger scale.
In summary, the main contributions of this work are:
A lightweight, interpretable multimodal fusion framework: We propose an interpretable classification pipeline that combines text, OCR, and visual signals to detect migration-related content on TikTok, outperforming single-modality baselines.
Identification of robust predictors: Through SHAP and feature importance analysis, we demonstrate that OCR-derived keywords (e.g., route terms) and regional indicators are the most reliable signals for detection, offering higher discriminative power than generic visual labels.
Robustness verification via synthetic augmentation: We validate the stability of the proposed features by scaling the training set to 500 samples using the Synthetic Minority Over-sampling Technique (SMOTE), providing evidence that the high classification performance is robust and not merely an artifact of the small pilot dataset.
1.4. Related Work
Research on migration and social media has shown how platforms such as TikTok, Instagram, and Facebook enable migrants and their families to share experiences, negotiate belonging, and exchange information about routes and risks, while also being used by smugglers to advertise crossings and connect with potential clients [
13,
14]. Studies of hashtags and communities (e.g., work on #Migrantes) highlight how platform-specific affordances and recommendation systems shape the visibility of precarious migration narratives, including emerging forms of “digital smuggling” and the circulation of route information via short videos and comments. Recent studies on TikTok ‘refugees’ communities [
15] highlight detection challenges that our features address [
16]. In parallel, a growing body of work examines TikTok’s multimodal complexity, proposing annotation schemes and analytical frameworks that take into account the joint role of video, audio, and on-screen text in meaning-making. Recent Transformer-based systems such as MTikGuard detect harmful content on TikTok with high performance (reported F1 ≈ 0.89), but are designed for large-scale moderation and do not focus on interpretable migration-specific cues [
17]. Similarly, attention-guided frameworks have demonstrated the value of explainable feature attribution in social media content moderation. For instance, Shukla et al. [
18] propose an encoder-only Transformer that identifies which input features (text, metadata, user behavior) drive detection decisions for fake profiles, demonstrating that interpretability builds trust in automated moderation systems. This aligns with our use of SHAP-based feature attribution to ensure transparency in migration content monitoring.
Beyond migration, multimodal classification approaches have reported consistent performance gains by fusing textual and visual representations. For example, Sánchez-Villegas et al. [
19] demonstrate improved multimodal classification performance using Ber-ViT fusion on large-scale social media datasets. In contrast, our work focuses on interpretable feature-level fusion under extreme data scarcity. These studies frequently rely on deep architectures and large datasets, but they rarely focus on small, highly sensitive domains where interpretability and careful feature design are essential. Beyond academic work, industry and technical reports emphasize that multimodal moderation can uncover harmful or policy-violating content that text-only systems miss by jointly analyzing captions, audio, and visual scenes [
20,
21,
22]. This broader shift towards multimodal moderation frameworks reinforces the need for interpretable feature design in domains such as irregular migration, where automated decisions have potentially serious consequences.
Finally, research on weak supervision and distant labeling provides methods for scaling annotation by combining heuristic labeling functions, model predictions, and noisy signals from multiple sources, which is particularly relevant for domains where manual labeling is costly and risky. The present study connects these strands by focusing on a small, interpretable multimodal feature space for migration-related TikTok content and by discussing how such features could support future weakly supervised labeling at larger scale.
2. Dataset
Our study relies on a manually curated dataset of 50 TikTok videos that are potentially related to irregular migration journeys or clearly non-migration content. The dataset is designed as a small-scale pilot resource that combines multimodal signals in a compact, interpretable feature table suitable for benchmarking fusion approaches and analyzing which cues are most informative.
Data collection in this domain is constrained by the adversarial and ephemeral nature of the content. As noted in our studies on text-based platforms [
11,
12], actors involved in irregular migration frequently adapt their vocabulary and hashtags to evade detection. On TikTok, this is compounded by aggressive moderation policies that lead to rapid takedowns of migration-related videos. Consequently, constructing a large-scale, persistent dataset is inherently difficult, necessitating a pilot approach where a smaller, manually verified corpus serves as a stable ground truth to validate interpretable features before attempting larger-scale weak supervision.
Each video is represented through three primary information sources: platform metadata, automated video analysis outputs, and OCR text extracted from frames or static images.
2.1. Raw Data Sources
The first source consists of TikTok platform metadata, including video identifier, URL, uploader handle (platform-specific user identifier), background sound, caption text, publishing date, approximate location (when available), and engagement statistics such as like, comment, share, and save counts. These fields provide basic contextual information, as well as short textual snippets that may explicitly reference migration routes or destinations.
The second source is the output of the Google Cloud Video Intelligence API (Google LLC, Mountain View, CA, USA) [
23], which was applied to each video to obtain segment-level labels with label type, textual description, temporal boundaries, and confidence scores. Across the 50 videos, this process produced 4489 label annotations, covering concepts such as boats, sea, vehicles, people, and miscellaneous scene descriptors. These labels act as a structured proxy for visual content and allow the construction of binary indicators for high-level concepts (e.g., maritime scenes or group movement) without training a custom vision model.
The third source covers text extracted from the visual channel using OCR. For all videos, we applied OCR to selected frames to recover overlaid text, while for ten static-image with audio videos, we additionally used the Google Vision API to obtain image-level OCR and locale information. This step captures route descriptions, place names, and other textual cues that appear on screen but are not present in the caption field, which is particularly important when captions are short, noisy, or written in non-English languages.
2.2. Video-Level Feature Construction
All three raw sources were aggregated into a unified video-level representation. For each video, we computed global statistics such as the total number of detected entities (n_entities), the number of entities that were manually mapped as migration-related (n_positive), and their proportion (fraction_positive). In addition, we derived binary indicators for specific concepts and geographic regions, informed by qualitative inspection of the content and the underlying labels.
Visual labels from the Video Intelligence API were grouped into interpretable flags, such as hasLbl_boat, hasLbl_sea, hasLbl_ocean, hasLbl_vehicle, hasLbl_text, and hasLbl_wave. OCR-derived cues were encoded as binary indicators for migration-related terms, including hasOCR_route, hasOCR_boat, and hasOCR_jet_boat, based on keyword lists that were normalized to English.
Multilingual OCR Processing: The Google Cloud Vision API provides production-grade multilingual OCR with integrated language detection, transliteration, and translation capabilities [
24]. For each video frame or static image, the API:
Detects script and language (e.g., Arabic script → language: Modern Standard Arabic or dialectal variants; Greek script → language: Greek; Latin script → language: English, French, Turkish, etc.)
Extracts raw text in the original script with confidence scores
Transliterates non-Latin scripts to Latin characters where applicable (e.g., Arabic “طريق” → transliteration: “tariq”)
Translates to English via the API’s integrated translation layer (e.g., “طريق” → English: “route”; “λέμβος” → English: “boat”)
We apply case-insensitive keyword matching to the English-normalized text against predefined migration-relevant term lists:
- –
Route terms: {route, road, path, way, crossing, border, passage}
- –
Vessel terms: {boat, ship, dinghy, raft, yacht, jet ski, rubber boat}
- –
Geographic terms: {Turkey, Greece, Italy, Balkans, Mediterranean, Aegean, Libya, etc.}
Table 1 provides concrete examples of the multilingual OCR extraction and normalization pipeline across Arabic, Greek, and mixed-script content. While errors may occur in low-resolution frames, mixed-script overlays (e.g., Arabic text with embedded Latin URLs), or heavily stylized fonts, manual verification of the N = 50 pilot dataset confirmed >90% OCR accuracy for migration-relevant keywords. For the remaining ~10% with OCR errors or missing text, the model relies on visual labels (hasLbl_boat, hasLbl_sea) and regional metadata (has_RegionEU, has_RegionTurkey) to make predictions, demonstrating the robustness of the multimodal fusion approach.
Higher-level semantic flags were introduced to capture group motion and geographic context, including has_group_movement, has_RegionEU, has_RegionTurkey, and has_RegionBalkans. The final feature table therefore combines text-like, visual, and regional signals into a compact multimodal vector for each video, accompanied by a binary, label is_relevant, indicating whether the content was annotated as migration-related (35 videos) or non-related (15 videos).
To improve interpretability and reproducibility, entity names and feature indicators were normalized and documented using a consistent naming scheme so that future work can re-implement or extend the feature extraction process without access to the original TikTok content.
2.3. Annotation Protocol and Ethics
Videos were identified through keyword-based search and exploratory browsing on TikTok, focusing on content that appears to depict sea crossings, border regions, or narratives about migration journeys, as well as clearly unrelated control content.
All 50 videos were manually labeled by a single researcher (first author) following a two-stage annotation protocol designed to minimize label-feature circularity:
Stage 1 (Automated Feature Extraction): Platform metadata, OCR text, and visual labels were extracted via Google Cloud APIs without human interpretation of migration relevance.
Stage 2 (Semantic Annotation): The annotator reviewed each video’s visual narrative, caption context, and overall communicative intent to determine migration relevance. Videos were labeled as migration-related (y = 1) if they depicted sea crossings, border journeys, or migration narratives based on holistic interpretation of the audiovisual content, not solely on the presence of individual extracted features. For instance, a tourism video containing the word “boat” and maritime scenes would be labeled y = 0 based on contextual disambiguation (presence of leisure indicators, absence of distress or group movement cues), even though isolated features overlap with migration content. Conversely, videos with implicit migration signals (e.g., crowded dinghies, nighttime departure scenes) were labeled y = 1 even when OCR or captions lacked explicit migration keywords.
This two-stage design ensures that the extracted features serve as evidence corroborating human judgment, rather than deterministic labeling criteria, reducing the risk that high classification performance merely reflects internal consistency of engineered flags rather than genuine detection of migration-related narratives.
Ambiguous cases where multimodal signals were contradictory or unclear were excluded during the initial screening phase, resulting in a high-confidence binary dataset.
While formal inter-annotator agreement could not be computed due to resource constraints in this pilot study, we conducted a post hoc reliability check: a second independent annotator (blind to the extracted features) reviewed a random subset of 10 videos (20% of the dataset) using the same two-stage protocol. This validation yielded 95% agreement (1 disagreement on a borderline case involving a travel vlog with maritime elements but no migration context), supporting the reliability of the semantic judgment layer despite single-annotator constraints in the full dataset. Future work should employ multiple annotators and compute formal Cohen’s Kappa on a larger sample (N ≥ 100) to establish definitive inter-rater reliability benchmarks.
Given the sensitivity of irregular migration and the platform’s terms of service, raw videos and user-identifying information are not redistributed. Instead, the work focuses on derived features and aggregate statistics designed to support methodological transparency while reducing privacy and security risks for content creators and migrants. URLs and identifiers are kept only for internal reproducibility and are not intended for public release, and any future sharing of feature schemas or code will follow institutional ethical approvals and platform policies.
Potential biases were mitigated by utilizing the Google Cloud Vision API, which employs production-grade multilingual models explicitly supported for the target languages [
24]. Although errors in mixed-script overlays or low-resolution frames are possible, manual verification of the pilot dataset ensured the validity of the extracted keywords. Future work will further address noise through language-aware normalization and multilingual models such as mT5 [
25]. No user or migrant personally identifiable information is retained; only anonymized, aggregated feature vectors are used. This approach ensures privacy protection alongside full methodological reproducibility.
Given the vulnerability of populations involved in irregular migration, this study was conducted under strict ethical protocols designed to minimize potential harm. Although the research relies exclusively on publicly available social media content, we treated the dataset as ethically sensitive material. The methodology operates solely on derived and anonymized feature representations rather than raw user data. No personally identifiable information, user handles, or raw video files are redistributed or retained. The study does not involve interaction with human subjects. Therefore, according to institutional and international research guidelines for publicly accessible data, formal IRB approval was not required under institutional guideline.
The silver-standard dataset is used exclusively to validate scalability and feature transferability and is not used for performance benchmarking. To address the limitation of small sample sizes and to validate the scalability of our feature set, we constructed a larger “silver standard” dataset. We scraped metadata and thumbnails for N = 600 unlabeled videos using domain-specific hashtags (e.g., #migrantes, #harraga, #tahrib, #boatcrossing).
We then applied a heuristic labeling function based on the robust predictors identified in our pilot study. Specifically, a video was automatically assigned a positive label (y = 1) if migration-related keywords (e.g., “route”, “boat”, “border”) were detected via OCR in the thumbnail or present in the caption. While this inclusive ‘OR’ logic prioritizes high recall and may introduce limited label noise, it effectively captures the diversity of migration narratives within the pre-filtered hashtag communities and serves as a robust initialization for weak supervision. This automated process yielded a weakly labeled dataset of 600 videos (481 migration-related, 119 non-related), serving as a proof-of-concept for scalability, indicating that our interpretable features can filter content at scale without manual annotation.
It is important to note that the silver-standard labels have not been validated against manual ground truth and should be considered noisy weak labels. However, to assess the quality of these labels, we performed a systematic verification on a stratified random sample of N = 50 videos from the generated silver corpus (8.3% of the total). Manual review confirmed that 44 of the 50 videos were correctly classified as migration-related, yielding a Precision of 88%. This empirical validation confirms that the proposed interpretable features (OCR keywords and Region flags) act as a high-fidelity filter, allowing for scalable dataset creation with manageable label noise (<12%) without the need for massive manual annotation. Future work will systematically validate a stratified sample of these labels and apply label denoising techniques such as confident learning or Snorkel to improve label quality before using them for large-scale model training.
3. Methodology
The goal of the modeling pipeline is to predict whether a TikTok video is migration-related based on its multimodal features while maintaining interpretability and robustness under a small-sample regime. Our approach contributes to the growing literature on multimodal classification under constraints [
19], where interpretable feature engineering and auxiliary tasks enable scalable monitoring without requiring massive labeled datasets. To this end, the experiments compare single-modality baselines (Text, OCR, Vision) with a Fusion setting that concatenates all feature groups (
Figure 1), using three widely adopted classifiers with complementary bias–variance characteristics: Logistic Regression (LR), Random Forest (RF), and XGBoost (XGB). All models are trained and evaluated under stratified cross-validation to mitigate variance, and performance is reported in terms of accuracy, precision, recall, and F1-score, with a focus on the migration-related (positive) class.
3.1. Modalities and Feature Groups
The Text configuration uses caption text and selected metadata-derived indicators that encode linguistic or keyword information, ignoring purely visual or regional cues. The OCR configuration restricts the input to OCR-derived features, including binary indicators for migration-related terms such as route and boat extracted from frames and static images. The Vision configuration uses only visual label indicators from the Video Intelligence API (e.g., hasLbl_boat, hasLbl_sea, hasLbl_ocean, hasLbl_vehicle, hasLbl_wave), treating the presence of specific labels as high-level proxies for scene content. Finally, the Fusion configuration concatenates all feature groups—text-like features, OCR indicators, visual labels, and region/group-movement flags—into a single multimodal feature vector per video.
3.2. Classification Models and Evaluation
All data processing and model implementation were performed using the Python programming language (version 3.8; Python Software Foundation, Wilmington, DE, USA) and the scikit-learn library (version 1.0; NumFOCUS, Austin, TX, USA). For each modality configuration, three classifier families are trained: LR with L2 regularization, RF with an ensemble of decision trees, and gradient-boosted trees via XGBoost (Distributed by DMLC, Seattle, WA, USA) [
26]. These models are chosen because they are relatively data-efficient, provide access to feature importance measures, and support post hoc interpretability through SHAP analysis in the case of tree-based models. SHAP-based analyses have been widely adopted in applied machine learning as a practical tool for explaining individual and aggregate model predictions, offering local and global attributions that help domain experts understand which features drive decisions [
27,
28]. Extending SHAP to multi-modal social media [
29], our analysis shows OCR/region dominance.
Given the small dataset size (50 labeled videos), all experiments use nested stratified cross-validation to ensure unbiased performance estimation and prevent information leakage from test folds into hyperparameter selection. The nested CV structure consists of:
- –
Outer loop (5-fold stratified CV): The dataset is partitioned into 5 folds, ensuring class balance (maintaining the ~70/30 positive/negative ratio in each fold). Each fold serves as a held-out test set exactly once, while the remaining 4 folds form the training set for that iteration.
- –
Inner loop (3-fold stratified CV within each training set): For each of the 5 outer iterations, the 4 training folds undergo an additional 3-fold cross-validation for hyperparameter tuning via grid search. Hyperparameter combinations (e.g., RF: n_estimators ∈ {50, 100, 200}, max_depth ∈ {3, 5, 10}; XGB: learning_rate ∈ {0.01, 0.1, 0.3}, max_depth ∈ {3, 5, 7}) are evaluated on the inner validation folds, and the configuration maximizing F1-score on the positive class is selected.
- –
Final evaluation: The model is retrained on the full 4-fold training set using the optimal hyperparameters identified in the inner loop, then evaluated on the outer held-out test fold. This process repeats for all 5 outer folds.
This nested design (
Figure 2) guarantees that no test-fold information influences hyperparameter selection, as the test data is completely isolated from the tuning process. The reported metrics (
Table 2) reflect outer-fold performance averaged across the 5 iterations, providing an unbiased estimate of generalization performance within the N = 50 sample. We report mean accuracy, precision, recall, and F1-score across outer folds with standard deviations and 95% confidence intervals (via bootstrap resampling, 1000 bootstrap iterations), and for the best-performing Fusion models, we additionally compute feature importances and SHAP values to identify which multimodal cues drive the predictions.
3.3. Robustness Analysis via Synthetic Augmentation
To perform a sanity check on feature separability beyond the pilot sample, we implemented a synthetic augmentation protocol. This approach does not test generalization to new, unseen migration patterns or adversarial evasion tactics, but rather assesses whether the discriminative power of the identified features (hasOCR_route, has_RegionEU, etc.) remains consistent when the feature space is synthetically expanded. We partitioned the dataset into a stratified held-out test set of 10 real videos (20% of the total) to serve as a ‘gold standard’ for evaluation, and a training set containing the remaining 40 videos. To simulate a larger-scale data environment and diagnose feature stability beyond small-sample variance, we applied the Synthetic Minority Over-sampling Technique (SMOTE) (imbalanced-learn library; NumFOCUS, Austin, TX, USA) [
30] strictly within the training partition (ensuring that synthetic samples were generated only from training data and no information leaked into the held-out test set). This process generated a balanced, synthetically augmented training set of 500 samples (250 per class) based on the feature distributions of the real videos. The classifiers were then re-trained on this augmented dataset and evaluated exclusively on the held-out real videos to assess their generalization capability and feature stability. We emphasize that SMOTE is employed here strictly as a diagnostic tool to evaluate feature stability, not to inflate performance claims or substitute for real data.
Our research question is methodological: If the observed high performance in stratified cross-validation were due to overfitting or spurious correlations specific to the N = 50 sample, would feature importance rankings remain consistent when the feature space is synthetically expanded to N = 500?
The key finding is not the 100% held-out accuracy (which, given N = 10, carries wide confidence intervals and limited generalizability), but rather the consistency of feature rankings across settings. Specifically, hasOCR_route, has_RegionEU, and hasOCR_jet_boat emerge as the top predictors in both the original cross-validation experiments (
Table 2,
Figure 3 and
Figure 4) and the SMOTE-augmented setting (
Figure 4). This consistency suggests that these multimodal cues capture genuine discriminative patterns rather than sample-specific artifacts and supports their use as reliable anchors in weakly supervised labeling functions for larger-scale datasets.
Importantly, SMOTE generates synthetic interpolations within the existing feature space and does not introduce new migration patterns, linguistic variations, or adversarial content that would test true out-of-distribution generalization. The augmentation is strictly a diagnostic tool to assess feature stability, not a substitute for collecting real, diverse samples from evolving migration discourse on TikTok.
4. Experimental Results
This section summarizes the performance of the multimodal classifiers across modality configurations and examines which features drive predictions in the best-performing models. We first compare Text, OCR, Vision, and Fusion settings using standard classification metrics, and then analyze feature importances and SHAP values to identify the most informative multimodal cues for migration-related detection.
4.1. Overall Performance Across Modalities
Before evaluating machine learning classifiers, we establish a rule-based baseline that directly implements the multimodal cues identified during annotation. This baseline predicts a video as migration-related (y = 1) if any of the following conditions hold:
OCR contains migration-relevant terms: hasOCR_route = 1 OR hasOCR_boat = 1 OR hasOCR_jet_boat = 1
Regional and maritime combination: (has_RegionEU = 1 OR has_RegionTurkey = 1) AND (hasLbl_boat = 1 OR hasLbl_sea = 1 OR hasLbl_ocean = 1)
Group movement with maritime cues: has_group_movement = 1 AND (hasLbl_boat = 1 OR hasLbl_wave = 1)
This heuristic achieves Accuracy = 0.76, Precision = 0.94, Recall = 0.71, F1 = 0.81 on the N = 50 dataset under stratified 5-fold cross-validation, confirming that the identified multimodal cues are discriminative. However, the rule-based approach suffers from two limitations:
- –
False positives: Tourism or recreational boating content that mentions “boat” and coastal regions (e.g., “boat tour in Turkey”) is incorrectly flagged as migration-related.
- –
False negatives: Videos with implicit migration signals (e.g., crowded dinghies, nighttime scenes) but lacking explicit OCR keywords or regional tags are missed.
The machine learning classifiers below address these limitations by learning weighted feature combinations and contextual disambiguation patterns that simple Boolean logic cannot capture.
Single-modality models achieve F1-scores between approximately 0.71 and 0.86, indicating that each modality carries useful but incomplete information about migration-related content. The rule-based baseline (F1 = 0.81) outperforms some single-modality configurations (e.g., Text XGB, Vision XGB) but remains inferior to Text RF (F1 = 0.86), confirming that even simple ML models can learn discriminative patterns within individual modalities.
However, the Fusion configuration consistently yields the best performance, with LR and RF reaching mean accuracy around 0.90 and F1-scores up to 0.92 on the migration-related class, while XGB performs slightly lower but still above both single-modality baselines and the rule-based heuristic. These results indicate that combining visual labels, OCR cues, and regional indicators yields consistent gains (ΔF1 = +0.11 over rule baseline, +0.06 over best single-modality) through learned feature interactions rather than simple aggregation.
The performance gap between the rule-based baseline (F1 = 0.81) and Fusion ML models (F1 = 0.92) illustrates the value of learned feature interactions. While the rule flags any video containing “boat” keywords OR regional maritime cues, the Fusion models learn contextual disambiguation:
- –
A video with OCR text “boat tour Turkey” might trigger the rule (hasOCR_boat = 1, has_RegionTurkey = 1), but the Fusion LR model correctly classifies it as non-migration (y = 0) by down-weighting the combination when hasLbl_song = 1 and has_group_movement = 0, which correlates with touristic content in the training data.
- –
Conversely, a video with Arabic caption “رحلة” (journey) lacking explicit migration keywords may be missed by the rule (no OCR match), yet the Fusion RF model detects it through the joint presence of has_RegionEU = 1, hasLbl_boat = 1, hasLbl_wave = 1, and n_entities > 20 (indicating complex visual scenes typical of migration content).
This demonstrates that the ML classifiers identify multimodal corroboration patterns—combinations of features that jointly signal migration narratives—rather than simply aggregating individual cues. The rule-based approach cannot capture such interactions without exponentially expanding the rule set, which would require manual engineering and risk overfitting to the pilot sample.
At the same time, although the overall numbers are promising, they should be interpreted with caution because they are obtained on a small sample of 50 videos. In this setting, even stratified cross-validation is susceptible to variance across folds, and small changes in the train–test split can lead to noticeable fluctuations in the reported metrics. We therefore treat the results as indicative of the relative value of different modalities, rather than as definitive estimates of real-world performance. Metrics are reported for completeness; no claims of out-of-sample generalization are made due to the limited dataset size.
4.2. Feature Importance and SHAP Analysis
To better understand which multimodal cues drive the Fusion models, we analyze feature importances from the RF classifier and SHAP values from the XGB model trained on the Fusion configuration.
Figure 3 visualizes the distribution of SHAP values for the main Fusion features, illustrating how high values of OCR-based route terms and European or Turkish region indicators systematically push predictions towards the migration-related class.
Figure 4 summarizes the mean absolute SHAP values, confirming that hasOCR_route, has_RegionEU, has_RegionTurkey, and hasLbl_song have the largest overall impact on the model output.
In the RF Fusion model, OCR-related indicators such as hasOCR_route, has_RegionEU, has_RegionTurkey, and hasLbl_song emerge as the most influential features. The prominence of hasLbl_song likely reflects a platform-specific pattern, where migration-related content is frequently presented as static images or low-motion scenes accompanied by background music, rather than dynamic footage, underscoring the importance of on-screen text and geographic context for migration-related classification. Additional features such as has_group_movement, hasOCR_boat, hasLbl_boat, hasLbl_sea, and hasLbl_ocean also contribute positively, reflecting the relevance of group motion and maritime scenes to the migration class.
SHAP value distributions for the XGB Fusion model further illustrate that these features systematically push predictions towards the migration-related label across multiple videos, rather than being driven by a few isolated outliers. For example, videos with strong OCR signals about routes and boats and with European or Turkish coastal regions flagged tend to receive higher positive SHAP contributions, whereas videos lacking these cues are more likely to be classified as non-related even when they contain generic maritime elements. Taken together, these analyses suggest that interpretable OCR and region features are central to the success of the Fusion setting and provide concrete guidance for designing weakly supervised labeling rules in future, larger-scale studies.
4.3. Feature Stability Assessment via Synthetic Augmentation
The models trained on the synthetically augmented dataset (N = 500) demonstrated consistent feature importance rankings and perfect class separation on the held-out real test set (N = 10). Logistic Regression achieved perfect separation on this specific held-out slice (Accuracy = 1.0), while Random Forest and Gradient Boosting achieved F1-scores of 0.92 and 0.93, respectively. However, given the limited size of this test set (N = 10), the 95% Wilson score confidence interval for the perfect accuracy is wide (0.70–1.00), and these results should be interpreted strictly as evidence of high feature separability for these particular examples, not as a guarantee of zero-error performance on future, unseen migration content. SMOTE augmentation preserves the statistical structure of the original N = 50 sample and does not simulate new linguistic variations, route changes, or platform-specific evasion tactics that would challenge the model in real-world deployment.
While Logistic Regression achieved perfect separation on this specific held-out slice (Accuracy = 1.0), we caution that with N = 10, the 95% Wilson score interval is wide (0.70–1.00). Thus, this result validates that the identified features (OCR keywords, Region flags) are highly discriminative for these specific examples but does not guarantee zero-error performance in the wild.
As shown in
Figure 5, the feature importance analysis on the augmented dataset remained consistent with the original findings, identifying hasOCR_route, hasOCR_jet_boat, and has_RegionEU as the top predictors. This consistency suggests that these multimodal cues are robust signals for detecting migration-related content, even when the feature space is synthetically expanded.
Interestingly, generic metadata features such as n_positive and fraction_positive also appear among the top predictors. This likely reflects the higher visual complexity of migration-related scenes (e.g., crowded boats, dynamic maritime environments), which tend to trigger a larger volume of API detections compared to the simpler visual composition of many non-related videos.
In summary, the SMOTE-based analysis serves as a diagnostic sanity check confirming that the top-ranked features (hasOCR_route, has_RegionEU) are not artifacts of the specific N = 50 sample composition, but it does not validate generalization to new migration narratives or adversarial content. Future work should evaluate the framework on independently collected, temporally separated datasets to assess true out-of-distribution robustness.
4.4. Platform-Specific Artifacts and Limitations
While the multimodal features demonstrate strong discriminative power (F1 = 0.92), some high-ranking predictors may reflect dataset composition rather than intrinsic migration signals. This section examines potential artifacts and discusses implications for deployment scalability and temporal stability.
hasLbl_song as a Content Format Indicator: The label hasLbl_song emerged as a top-5 predictor in both the Random Forest feature importance analysis and SHAP value rankings (
Figure 3 and
Figure 4), which initially appears counterintuitive for migration content detection. Manual inspection of the dataset reveals that this feature likely reflects a platform-specific content pattern: many migration-related videos in our sample are static images or low-motion slideshows with overlaid text and background music, rather than dynamic footage. This format is not accidental—creators use image-based posts with burned-in text to evade TikTok’s automated video content moderation, which is more aggressive toward live-action footage depicting border crossings or boat journeys.
While hasLbl_song is discriminative in this pilot dataset, it is a proxy for evasion tactics rather than a semantic migration cue. This pattern may not generalize to other migration content formats (e.g., first-person vlogs, news clips, advocacy campaigns) or to other platforms where different moderation regimes incentivize different posting strategies. Future iterations of the framework should treat format-based features as context-dependent filters that may require periodic re-validation as platform policies and creator behaviors evolve.
Temporal and Geographic Stability: Several high-impact features are tied to specific migration routes and time periods. For instance, has_RegionTurkey and has_RegionBalkans reflect the Central Mediterranean and Eastern European routes that were prominent during the data collection period (2023–2024). As migration flows shift—due to border enforcement changes, policy interventions, or geopolitical events—the geographic distribution of route-related content on TikTok will change accordingly. A model trained exclusively on Turkey–Greece maritime crossings may miss content depicting Sahel overland routes, U.S.–Mexico border narratives, or Channel crossings between France and the UK.
To maintain accuracy over time, the framework requires adaptive feature updating: regional keyword lists and OCR dictionaries should be expanded based on monitoring of emerging hashtags, consultation with domain experts, and periodic analysis of unlabeled data streams. The interpretability of the feature set facilitates this adaptation—adding a new region indicator (e.g., has_RegionLibya) requires only updating a keyword list, not retraining complex neural encoders.
API Dependency and Vendor Lock-In: The current pipeline relies on Google Cloud Video Intelligence API for visual label extraction and Google Cloud Vision API for OCR. This introduces two risks:
Cost scalability: At production scale (e.g., monitoring 10,000 videos/day), API costs may become prohibitive, particularly for full video analysis with frame-level labeling.
Taxonomy changes: If Google modifies its label ontology (e.g., merging “boat” and “watercraft” labels, deprecating “wave”), the feature extraction pipeline would break, requiring re-mapping and potentially re-annotation of training data.
However, the SHAP-based feature importance analysis provides a roadmap for cost-efficient deployment that mitigates these risks:
High-impact, low-cost features:
- –
Regional indicators (has_RegionEU, has_RegionTurkey) can be extracted from caption geotags, video descriptions, and user profile locations at near-zero cost using simple keyword matching.
- –
OCR-based route terms (hasOCR_route, hasOCR_boat) require only OCR API calls on keyframes (1–3 frames per video) rather than full video analysis, reducing costs by ~85% compared to full Video Intelligence processing.
Redundancy-based optimization:
- –
Since OCR features alone achieve F1 = 0.83 (
Table 2), a cost-conscious production system could use OCR and regional flags as a first-pass filter, invoking expensive full-video analysis only for OCR-negative or ambiguous cases (e.g., videos with no text overlay but strong maritime visual cues).
- –
This tiered approach would process ~70% of content using low-cost features, reserving API-intensive analysis for the remaining 30%.
Vendor independence pathway:
- –
Future work should evaluate open-source alternatives to commercial APIs. For example:
- –
Replace Google Video Intelligence with CLIP-based keyframe embeddings (zero-shot classification into “boat,” “sea,” “crowd”) and simple text classifiers for captions.
- –
Replace Google Vision OCR with Tesseract or EasyOCR for Latin/Arabic script extraction, combined with translation APIs or multilingual sentence transformers (e.g., LaBSE, mT5) for keyword normalization.
A pilot experiment comparing API-based vs. open-source feature extraction on a subset of the data would quantify the performance–cost trade-off and inform deployment decisions. If open-source approaches achieve comparable F1 (e.g., within 0.05 of API-based performance), the framework could be deployed without vendor lock-in.
The identified artifacts (hasLbl_song format bias, geographic specificity) and API dependencies do not invalidate the proposed framework, but they do highlight the need for continuous monitoring, adaptive feature updates, and cost-aware deployment strategies. The interpretability of the feature set—ensured by the use of classical ML models and SHAP analysis—enables these adaptations without requiring expensive re-labeling or architectural overhauls. In contrast, end-to-end deep multimodal models (e.g., video-language Transformers) would require full retraining to accommodate route shifts or platform changes, making them less suitable for the dynamic, adversarial environment of migration content monitoring.
4.5. Error Analysis and Model Limitations
To understand the failure modes of the proposed framework and identify opportunities for improvement, we manually inspected misclassifications from the best-performing Fusion model (Logistic Regression, F1 = 0.92) and analyzed the 12% label noise observed in the weakly supervised silver-standard dataset (
Section 2.3). This analysis reveals systematic patterns in false positives and false negatives that highlight both the strengths and limitations of OCR- and region-based filtering.
False Positives: Tourism and News Content:
In the pilot dataset (N = 50), the Fusion LR model achieved perfect precision (1.00) under cross-validation, indicating zero false positives in this controlled sample. However, the silver-standard dataset—constructed via automated keyword-based labeling—exhibits approximately 12% label noise, primarily from false positives. Manual verification of a stratified random sample (N = 50) from the silver corpus identified two dominant error patterns:
Tourism and recreational content: Videos depicting boat tours, beach vacations, or coastal tourism that contain migration-related keywords in captions or OCR (e.g., “boat trip to Greek islands,” “Turkey beach resort”) but lack migration context. These cases trigger the OCR route filter (hasOCR_boat = 1, has_RegionEU = 1) but are contextually unrelated to irregular migration. The ML model correctly down-weights these in the pilot dataset by exploiting correlated signals (e.g., presence of hasLbl_beach, absence of has_group_movement), but the keyword-only weak labeling function cannot make this distinction.
News and policy discussion: Journalistic content or commentary videos discussing migration policy, border enforcement, or humanitarian crises without depicting actual journeys. For example, a news anchor standing in a studio with a caption “New migration route through Balkans” would be flagged as positive by the weak labeling heuristic, despite not being user-generated migration content. These videos often contain regional keywords and migration terminology but lack the visual cues (crowded boats, nighttime scenes, first-person footage) characteristic of journey documentation.
Implications: The 12% false positive rate in the silver dataset confirms that keyword-based weak labeling is a high-recall, moderate-precision strategy suitable for initial data collection but insufficient for production deployment without refinement. Future iterations should incorporate:
- –
Sentiment and discourse analysis: Distinguish between advocacy/news discourse and experiential narratives using linguistic features (e.g., first-person pronouns, present tense verbs, affective language).
- –
User profile signals: Creator account type (news outlet vs. individual user), follower count, and posting patterns can help filter institutional content from personal migration documentation.
- –
Visual scene classifiers: Add dedicated detectors for touristic scenes (beach umbrellas, hotel pools, tourist groups with cameras) to reduce maritime false positives.
False Negatives: Euphemisms and Land-Based Routes:
While the pilot dataset exhibits few false negatives in cross-validation (recall = 0.86 for Fusion LR), qualitative analysis of unlabeled TikTok content suggests three scenarios where the current feature set may under-capture migration-related videos:
Coded language and euphemisms: Smugglers and migrants increasingly use metaphorical language or coded terms to evade content moderation. For instance, references to “the game” (Balkan route slang), “journey,” “crossing,” or emoji sequences (🚢🌊
![Electronics 15 00850 i001 Electronics 15 00850 i001]()
) may not be captured by the current OCR keyword lists, which focus on explicit terms like “route,” “boat,” and “border.” Expanding the lexicon to include slang, regional dialects, and evolving jargon requires continuous monitoring and consultation with domain experts or native speakers.
Land-based and overland routes: The current feature set is optimized for maritime migration (Mediterranean, Aegean Sea crossings) due to the prevalence of boat-related labels (hasLbl_sea, hasLbl_boat) and coastal regions (has_RegionTurkey, has_RegionEU) in the training data. Videos depicting overland routes—Sahel desert crossings, U.S.–Mexico border treks, or Central Asian transit corridors—may lack these maritime cues and be missed by the model. Addressing this gap requires adding land-based visual features (hasLbl_desert, hasLbl_truck, hasLbl_fence) and expanding regional indicators beyond Mediterranean geographies.
Minimal text or visual content: Some migration-related videos consist of brief clips with no OCR overlay and generic captions (e.g., “🙏❤️”), relying entirely on visual narrative. If the video quality is low or the scene is ambiguous (e.g., nighttime footage with minimal visible detail), the Video Intelligence API may produce generic labels (hasLbl_darkness, hasLbl_night) that do not trigger migration-specific features. These cases would benefit from audio analysis (background sounds of engines, waves, voices) or deeper visual understanding via frame-level embeddings.
Implications: The current framework achieves strong recall (0.86) on the pilot sample but may exhibit lower recall on out-of-distribution content (non-maritime routes, evolving linguistic codes). The interpretable feature design facilitates incremental expansion—adding new keywords, regions, or visual labels—but systematic evaluation on diverse migration corpora (land routes, different time periods, multiple languages) is necessary to quantify and address these gaps.
Lessons for Weakly Supervised Scaling: The error patterns observed in both the pilot dataset and the silver corpus provide actionable guidance for improving weak supervision at scale:
- –
Precision–recall trade-off: The current weak labeling function (keyword-based OR logic) prioritizes recall, accepting ~12% noise to capture diverse migration narratives. For applications requiring higher precision (e.g., content moderation with legal implications), a stricter AND logic combining multiple modalities (OCR route terms AND regional flags AND maritime visual labels) would reduce false positives at the cost of lower recall.
- –
Iterative refinement: Use the trained ML model to re-score the silver-standard dataset, identifying high-confidence errors (e.g., P(y = 1) < 0.3 for weakly labeled positives) for manual review and label correction, following confident learning or Snorkel-style denoising protocols.
- –
Active learning loop: Deploy the model to rank unlabeled videos by prediction uncertainty (0.4 < P(y = 1) < 0.6), prioritizing borderline cases for human annotation to maximize information gain and improve decision boundaries in ambiguous regions of the feature space.
In summary, the observed error patterns are not fundamental limitations of the multimodal approach, but rather opportunities for feature expansion, lexicon updates, and label refinement that the interpretable framework is explicitly designed to accommodate.
5. Discussion and Limitations
The results of this small-scale study suggest that combining video labels, OCR text, and regional indicators substantially improves migration-related content detection on TikTok compared to single-modality models. The strong contribution of OCR- and region-based features aligns with qualitative observations that many migration videos rely on on-screen text and geographic hints rather than explicit captions, especially when captions are short, multilingual, or otherwise difficult to process with standard tools. From a methodological perspective, these findings reinforce the value of interpretable feature design in sensitive domains where fully end-to-end deep models may be difficult to audit and reproduce. From an applied perspective, OCR- and region-focused signals appear particularly valuable for early-stage monitoring systems when captions are sparse, multilingual, or intentionally ambiguous, functioning as high-recall filters for further inspection.
At the same time, several limitations constrain how these results should be interpreted. The primary limitation is the small dataset size (N = 50 videos: 35 migration-related, 15 non-related), which constrains statistical power. The restriction to 50 videos was necessitated by the high complexity of manual multimodal annotation, which required verifying audio, scrutinizing frame-by-frame OCR, and checking metadata for each ephemeral video—a process significantly more labor-intensive than text-only labeling. This limitation is mitigated through stratified 5-fold cross-validation with bootstrap 95% confidence intervals (
Table 2), SMOTE augmentation to N = 500 followed by evaluation on a held-out real test set of N = 10 (
Table 3), and the construction of a silver-standard dataset (N = 600 videos labeled via OCR heuristics). The retrieval of 481 migration-related videos confirms that combining OCR signals with metadata is a viable strategy for large-scale monitoring, while the stable performance on held-out real data supports the discriminative capacity of the selected features within this pilot setting. Consequently, these findings support the suitability of the proposed feature set as robust anchors for weakly supervised labeling and downstream monitoring at scale.
A second limitation concerns the dependence on platform- and API-specific features. The work relies on TikTok’s content and on Google Cloud Video Intelligence and Vision APIs to extract visual and OCR-based signals, which may not generalize to other platforms, toolchains, or future versions of these services. This dependence complicates full reproducibility and may introduce biases related to how commercial systems detect and label visual entities in migration contexts.
Finally, the study focuses on a binary classification task (migration-related vs. non-related) and does not address finer-grained distinctions such as differentiating between informational, promotional, or testimonial content about irregular migration. It also does not explicitly model potential harms associated with misclassification, such as false positives that could affect migrants’ expression or false negatives that might hinder risk monitoring. Addressing these issues will require larger datasets, more nuanced annotation schemes, and closer collaboration with domain experts, civil society organizations, and affected communities.
We acknowledge the potential of Large Language Models (LLMs) such as GPT-4o [
31] or Llama 3 [
15] for multimodal understanding. Rather than viewing them as replacements for feature engineering, we position LLMs as complementary weak supervision oracles within a student–teacher paradigm. In future work, LLMs could be used to generate noisy, zero-shot labels for large corpora of unlabeled TikTok videos, serving as a silver-standard signal for training lightweight, interpretable fusion models. This approach would enable scaling from 50 to thousands of videos while preserving the privacy, speed, and auditability advantages of the proposed feature-based pipeline.
Interpretability and auditability are central to this work. Our fusion approach explicitly identifies which signals (e.g., specific OCR keywords or regional indicators) trigger a classification, offering a level of transparency that “black-box” LLMs currently lack. This is critical for avoiding biases against vulnerable populations.
Beyond interpretability, our feature-fusion approach offers distinct advantages in privacy and latency compared to end-to-end Large Multimodal Models (LMMs). Processing video content through external LMM APIs (e.g., GPT-4o) typically incurs high latency and requires transmitting sensitive visual data (faces of vulnerable individuals) to third-party cloud servers. In contrast, our pipeline extracts lightweight feature vectors locally. The inference time for our XGBoost model is negligible (on the order of milliseconds) once features are extracted, enabling real-time monitoring on edge devices without exposing sensitive biometric data to external providers. This makes the proposed approach privacy-preserving by design and better aligned with strict data protection requirements.
The feature schema developed in this study can underpin a weakly supervised labeling pipeline for larger-scale datasets. For example, simple labeling functions could flag candidate migration-related videos when combinations of OCR-derived route or boat terms and regional indicators (e.g., European or Turkish coastal regions) are present, possibly together with maritime visual labels such as hasLbl_boat or hasLbl_sea. These heuristic labels could then be combined and denoised using existing weak supervision frameworks [
32], and iteratively refined with human validation on a smaller subset of high-uncertainty cases.
Future Directions: Towards Scalable, Robust Migration Monitoring
The interpretable multimodal framework presented in this study serves as a methodological foundation for several research and deployment pathways that address the limitations identified in
Section 4.4 and
Section 4.5 while scaling to real-world monitoring scenarios.
Expanding Dataset Scale and Diversity: The pilot dataset (N = 50) provides proof-of-concept for feature-level fusion, but production deployment requires validation on larger, more diverse corpora. Immediate next steps include:
Temporally separated validation set: Collect N = 200–500 videos from 2025–2026 (post-pilot period) to assess whether the identified features (hasOCR_route, has_RegionEU) retain discriminative power as migration routes, platform policies, and creator behaviors evolve. This out-of-time validation would reveal feature degradation rates and inform adaptive updating schedules.
Geographic expansion: Extend regional coverage beyond Mediterranean routes to include Sahel–Libya–Italy corridors, U.S.–Mexico border crossings, and Channel (France–UK) routes. This requires building region-specific keyword dictionaries in collaboration with migration researchers, NGOs, and native speakers of Arabic dialects, Spanish, French, and other relevant languages.
Multi-annotator reliability study: Recruit 3–5 independent annotators (including domain experts and native speakers) to label a stratified sample of N ≥ 100 videos, compute formal inter-rater agreement (Krippendorff’s alpha, Cohen’s kappa), and refine the annotation protocol based on disagreement analysis. This would strengthen the validity of the ground truth and enable more robust performance claims.
Active Learning and Human-in-the-Loop Refinement: Rather than passively collecting labeled data, we propose an active learning loop that leverages the current model’s uncertainty to maximize annotation efficiency:
- –
Deploy the Fusion LR model to score unlabeled TikTok videos in real-time, computing prediction probabilities P(y = 1) for each video.
- –
Identify uncertain samples (0.4 < P(y = 1) < 0.6) where the model is least confident. These borderline cases—e.g., tourism content with maritime elements, news clips with migration keywords—are precisely the examples that would most improve the decision boundary if labeled.
- –
Prioritize these uncertain videos for manual annotation, updating the training set incrementally. Simulation studies suggest that active learning can reduce annotation costs by 40–60% compared to random sampling while achieving equivalent performance.
This human-in-the-loop approach also enables continuous adaptation to adversarial evasion: as smugglers and creators adopt new coded language or visual tactics, the model flags novel patterns for human review, and verified examples are fed back into training.
Adversarial Robustness and Evasion Detection: Migration-related content on TikTok operates in an adversarial environment where creators intentionally evade moderation. Future work should systematically study and counter evasion tactics:
- –
Linguistic evasion: Test the framework’s robustness to character substitutions (e.g., “b0at” instead of “boat”), emoji replacements (🚢 for “ship”), and code-switching (mixing Arabic, English, and emoji in the same caption). Develop character-level n-gram features or fuzzy keyword matching to capture obfuscated terms.
- –
Visual evasion: Analyze how static images with burned-in text (the hasLbl_song pattern identified in
Section 4.4) circumvent video-based moderation. Investigate whether such content exhibits distinctive patterns in frame variance, motion scores, or audio spectrograms that could be exploited as counter-evasion features.
- –
Platform adaptation monitoring: Track changes in TikTok’s content moderation policies and recommendation algorithms, assessing their impact on migration content visibility and format choices. This sociotechnical analysis would inform feature re-validation schedules and guide deployment strategies.
Cross-Platform Generalization: While this study focuses on TikTok, irregular migration content appears across multiple platforms (YouTube Shorts, Instagram Reels, Telegram, Snapchat). We hypothesize that the interpretable multimodal features identified here (OCR route terms, regional indicators, maritime visual labels) may transfer to other short-form video platforms with minimal adaptation:
- –
Conduct a pilot transfer study: Apply the trained model to N = 100 videos from YouTube Shorts and Instagram Reels (same time period, same migration routes) without retraining. Measure performance degradation and identify platform-specific feature gaps.
- –
If transfer performance remains within ΔF1 < 0.10 of TikTok performance, this would validate the generality of the feature set and enable cross-platform monitoring with a unified pipeline.
- –
Platform-specific features (e.g., Instagram’s geotag structure, YouTube’s chapter markers) could be incorporated as optional modalities without redesigning the core framework.
Reducing API Dependency via Open-Source Pipelines: As discussed in
Section 4.4, the current reliance on Google Cloud APIs poses cost and vendor lock-in risks. A high-priority direction is benchmarking open-source alternatives:
- –
Replace Google Video Intelligence with CLIP (Contrastive Language-Image Pretraining) for zero-shot visual classification. Extract keyframe embeddings (1–3 frames per video) and compute cosine similarity to text prompts like “crowded boat at sea,” “people on rubber dinghy,” “maritime rescue operation.”
- –
Replace Google Vision OCR with Tesseract or EasyOCR for Arabic and Greek script extraction, combined with mT5 or LaBSE (Language-agnostic BERT Sentence Embedding) for multilingual text normalization.
- –
Conduct a cost-performance analysis: If open-source features achieve F1 ≥ 0.85 (within 0.07 of API-based F1 = 0.92), the framework could be deployed at scale without commercial API fees, enabling broader adoption by NGOs, researchers, and civil society organizations with limited budgets.
Ethical Deployment and Stakeholder Engagement: Any real-world deployment of migration content detection must address ethical risks:
- –
False positives and moderation harms: Automated flagging of migration content could lead to censorship of migrant voices, humanitarian advocacy, or journalism. Future work should explore purpose-aware classifiers that distinguish between smuggling advertisements (potentially harmful) and migrant documentation/solidarity content (potentially valuable).
- –
Stakeholder co-design: Engage migrants, NGOs, and platform moderators in the design and validation of monitoring tools, ensuring that detection systems serve humanitarian goals (e.g., disrupting smuggling networks, identifying distress signals) rather than enabling indiscriminate content removal.
- –
Transparency and contestability: Publish the feature schema, keyword lists, and SHAP-based explanations as open resources, allowing affected communities to understand and challenge automated decisions.
This pilot study demonstrates that interpretable multimodal feature fusion can detect migration-related content under extreme data scarcity (N = 50), but translating this proof-of-concept into scalable, robust, and ethical monitoring infrastructure requires systematic work across data diversity, adversarial robustness, cross-platform generalization, and stakeholder engagement. The feature-based design explicitly facilitates these extensions—new keywords, regions, or visual labels can be added without architectural redesign—positioning the framework as an incremental, adaptive foundation for ongoing research in this sensitive and rapidly evolving domain.
6. Conclusions
This paper presents a small-scale multimodal study of migration-related content detection on TikTok, using interpretable feature fusion over video labels, OCR text, and metadata-derived indicators. By systematically comparing Text, OCR, Vision, and Fusion configurations with LR, RF, and XGB classifiers on a manually annotated set of 50 videos, the study showed that multimodal fusion consistently outperforms single-modality baselines, with F1-scores up to 0.92 on the migration-related class. Feature importance and SHAP analyses further highlighted the central role of OCR-derived keywords, maritime labels, and regional flags for capturing migration narratives that may not be visible in captions alone.
Beyond reporting performance numbers, the work argues for the importance of interpretable multimodal feature design in sensitive, data-scarce domains such as irregular migration. The feature schema and empirical insights outlined here can inform future efforts to build larger, more diverse datasets and to design weakly supervised labeling strategies that combine OCR, visual, and regional cues at scale. In doing so, they can support researchers and practitioners who monitor risky journeys and public narratives about migration on short-video platforms, while maintaining a clear focus on transparency, reproducibility, and ethical considerations.
Beyond classification performance, the primary contribution of this work lies in isolating a compact and interpretable set of multimodal cues that can serve as reliable anchors for weakly supervised labeling at scale. In this sense, the study bridges exploratory multimodal analysis with practical dataset construction, offering a pathway for expanding migration-related datasets while maintaining transparency and expert oversight.