1. Introduction
Visual surveillance is a foundational component of smart city infrastructure, supporting real-time applications such as traffic management and anomaly detection. These cameras operate under diverse configurations, varying in viewpoint, angle, elevation or location, as illustrated in
Figure 1. This diversity makes it infeasible to rely on a single machine learning model that performs well across all environments. Instead, prior work has shown that camera-specific models are more effective and efficient in such settings [
1,
2].
Training a unique model from scratch for each surveillance camera is often infeasible due to the high computational cost and the vast amount of devices. For example, training a single model can require eight hours on a Tesla V100 GPU (NVIDIA, Santa Clara, CA, USA), and a city such as London has over 940,000 cameras [
3]. An alternative strategy is to design lightweight models tailored for a specific task while reducing computational requirements [
4]; such specialized models may not be readily available for all target tasks, particularly when the available data are unannotated. Consequently, a common strategy to obtain these location-specific models is to fine-tune pretrained models [
5]. However, this approach is sensitive to domain shift: if the source and target domains differ significantly, model performance can degrade substantially [
6]. The selection process itself is further complicated by the unreliability of reported performance metrics. For instance, a study on real-world systems found a discrepancy as high as 44% between a model’s claimed accuracy and its measured operational performance [
7]. These factors highlight the critical need for methods that can estimate a model’s effectiveness on specific target data prior to deployment.
Estimating the transferability of a pretrained model to a new environment is inherently challenging as it involves quantifying the distributional shift between source and target data. In practical scenarios such as smart surveillance, access to the original training data is often restricted due to privacy concerns. While the European Union promotes increased data sharing across industrial and public sectors to drive innovation [
8], the direct exchange of raw data often conflicts with GDPR [
9] or risks exposing proprietary information. One solution to prevent data violation is to limit access and provide encryption on the stored data [
10]. An alternative solution, the source-free setting, which shares trained models instead of data, is increasingly regarded as a privacy-preserving alternative for enabling collaboration without disclosing sensitive datasets. In addition, no labeled data are typically available for the target domain. This source-free, unsupervised setting [
11] significantly complicates transferability assessment, as it precludes direct comparison between source and target domains and limits the use of traditional domain adaptation techniques.
Figure 2 further illustrates this concept of source-free unsupervised domain adaptation (SFUDA) in smart surveillance settings. First, a model zoo is formed of models pretrained on data collected under diverse configurations. These models could have different architectures, training data or training procedures. When a new camera is installed in the system, a configuration procedure finds the model from the model zoo that best fits the new scene. For this, each of the models processes a small amount of camera data and the transferability assessment is used to rank them, identifying the pretrained model that is most adaptable to the target domain. Finally, the highest-ranked model is either deployed directly or after further fine-tuning on target data for a specific camera. Crucially, this procedure requires no access to the original training data of the models in the model zoo, nor does it require labeled information for the target task that will be performed on the new data (i.e., it operates in the source-free unsupervised setting).
We propose a novel approach to transferability assessment based on randomly initialized neural networks (RINNs). Unlike existing methods that rely on assumptions such as reliable model uncertainty or the generalizability of pretrained embedding spaces, assumptions which often fail in real-world surveillance scenarios, our method leverages RINNs to construct a task-agnostic, data-independent reference embedding space. Transferability is then estimated by computing the similarity between embeddings produced by each pretrained model and those from the RINN ensemble, using minibatch-Centered Kernel Alignment (CKA) for scalability and efficiency. Despite lacking training, RINNs have been shown to capture meaningful structural priors through architectural depth and compositional nonlinearity [
12,
13]. This enables robust model selection without requiring access to source data, target labels, or unreliable uncertainty estimates. In the context of practical applications, our framework is designed for an offline model selection phase, which would typically be executed on a server or cloud infrastructure. Following this one-time assessment, the single best-identified model could then be further fine-tuned for a specific target if needed, and subsequently deployed for efficient, real-time inference on resource-constrained edge devices at the surveillance site.
Through experimental evaluations on four unlabeled source video datasets (Tokyo Intersection [
14], Tokyo Street [
15], and Agdao-Market/Street [
16]) and three public target sets (Street Scene [
17], NWPU Campus [
18], and Urban Tracker [
19]), we demonstrate that this is a cost-efficient solution for source-free, unsupervised domain adaptation in smart-city environments.
To summarize, this research confronts the critical challenge of assessing model transferability under source-free and unsupervised conditions, a problem of practical importance for deploying machine learning models in real-world scenarios such as smart city surveillance. To the best of our knowledge, this specific challenge remains underdeveloped, with very few prior works addressing it directly. Our main contributions to this area are as follows:
We introduce a novel transferability assessment framework designed for the aforementioned challenging conditions, which effectively addresses the diverse camera configurations and significant domain shifts inherent in smart surveillance applications.
The proposed framework uniquely employs ensembles of randomly initialized neural networks (RINNs) to create a task-agnostic reference embedding space. This approach avoids biases inherent in using pretrained models for the reference itself and enables model assessment without prior knowledge of the specific downstream task or objective.
We developed an embedding-level score () by comparing structural similarities in data representations, reflecting intrinsic data characteristics. This approach yields more robust, task-agnostic transferability estimates, avoiding the instabilities tied to pseudo label-based methods or the reliance on task similarity, particularly under shifting downstream objectives.
Comprehensive empirical validation on three public surveillance datasets (Street Scene, NWPU Campus, and Urban Tracker, covering 48 diverse scenes) and three downstream tasks (object tagging, anomaly detection, and event classification) demonstrates our metric’s efficacy. Results confirm consistently identifies the most transferable models, proving its practical utility for identifying the most suitable model for adaptation within these challenging real-world surveillance scenarios.
The remainder of the paper is structured as follows.
Section 2 introduces past related work on transferability assessment, randomly initialized neural networks, embedding similarity and source-free unsupervised domain adaptation. In
Section 3, we describe the proposed framework, the construction of the reference embedding space using RINNs, and the definitions of the two novel transferability scores. The evaluation process, including the details of the model zoo, datasets, downstream tasks, and implementation specifics are also explained in
Section 3. Then, the experimental results and ablation study are presented and discussed in
Section 4. Finally, we conclude the findings and discuss future research directions in
Section 5.
3. Materials and Methods
In this section, we present our novel transferability assessment framework designed for source-free unsupervised domain adaptation. We first introduce the overall structure and key components of the framework. We then introduce the two proposed RINN-based assessments, label-level score , and embedding-level score . Finally, we describe the benchmark datasets and models that will be used in the experimental evaluation.
This proposed framework operates on video frames obtained from the target environment. While the characteristics of this data are inevitably influenced by the camera hardware, environmental conditions including lighting, and any preliminary preprocessing, the transferability assessment method itself is designed to be independent of these specific underlying hardware components, focusing instead on the data as presented.
3.1. Framework
Our proposed framework is illustrated in
Figure 3. Let
denote a model zoo of
N pretrained models and let
X be the unlabelled target data. Each model
encodes
X into a representation
; these representations are then used to evaluate the transferability of each model.
In addition, we construct an RINN zoo based on a set of M randomly initialized neural networks. Given one pretrained representation and the collection , we define a transferability score , which measures how much of the information captured by the RINNs is already present in . A higher score implies that is more adaptable to the target domain. We aim to rank based on their adaptability to X without requiring source data or target annotations. We compare two approaches of using the RINNs for transferability estimation: label-level scoring and embedding-level scoring, as explained in the following paragraphs.
3.1.1. Label-Level Score
In this approach, we use the RINNs to directly generate pseudo labels
for each input
X:
, where
is a simple, randomly initialized classifier added on top of
. This layer predicts a pseudo label for each input. The predicted pseudo label has no semantic meaning but instead should be seen as an assignment to a cluster grouping similar inputs. In our experiments, we set each
to predict 20 classes. We then add a similar fully-connected layer
on top of each
and fine-tune this layer using gradient descent to predict the pseudo label. The underlying intuition for this approach is that predicting the pseudo label is only possible if the pretrained model has a rich enough feature representation that captures broad information covering the features extracted by a large set of RINNs. The transferability score is then the sum over all the
:
where
is a performance metric such as accuracy, F1-score or MSE.
3.1.2. Embedding-Level Score
A disadvantage of the label-level score is that it requires a training step which might be costly. In addition, the pseudo labels sometimes collapse where the RINN predicts the same output for each input. A large set of RINNs is needed to avoid this issue.
An alternative scoring mechanism uses internal representations of the RINNs and pretrained models. Based on knowledge distillation approaches, we can use metrics such as correlation-based similarity [
40] or Centered Kernel Alignment (CKA) to directly measure the similarities between both internal representations. The embedding level score is defined as:
In this paper, we use minibatch-CKA, proposed in [
35], to measure the structural similarity between
and
. We adopt minibatch-CKA not only for its alignment with representation-based analysis but also for its practical benefits: it enables scalable evaluation by computing similarity over mini-batches, thereby significantly reducing memory usage compared to standard CKA. To calculate minibatch-CKA,
X is first divided into batches with
B samples,
. By applying pretrained model
and randomly initialized model
to
, we obtain two representations
and
. Note that
and
represent the number of dimensions of
and
which are not necessary the same. As minibatch-CKA uses a linear kernel to calculate CKA, we further define the kernel matrices
and
. The minibatch-CKA is then calculated using the following equation:
where
In Equation (
3), HSIC is the Hilbert–Schmidt Independence Criterion [
41], which was used to measure the dependence between two sets of variables. To make the HSIC independent to
B so that CKA can be calculated batch-by-batch, minibatch-CKA uses the unbiased HSIC estimator (
) in [
36]. For detailed derivation and proof, please refer to [
33]. With the minibatch-CKA, Equation (
2) is then
.
3.2. Models
We construct a model zoo comprising models with identical architectures but trained on separate source videos. This design isolates data-induced variations in learned representations while controlling for architectural differences. All models use the ASTNet backbone [
42] and follow the training configuration for ShanghaiTech, as described in the original paper. In addition to the configuration-specific models, we include a universal model trained on the combined source data from all configurations. This allows us to evaluate whether a universal model can match or exceed the performance of specialized models under identical resource constraints. For evaluation, all models from the zoo were used off-the-shelf, kept frozen, and uniformly applied to each target dataset without task-specific fine-tuning. Each model encodes the target data into feature representations, which are then used to perform transferability assessments. These assessments rank the models by their adaptability to the target domain. Importantly, in alignment with real-world surveillance constraints, the source data is assumed to be inaccessible during the transferability evaluation process.
The RINN zoo is instantiated with
networks, a value chosen to balance computational cost and ranking stability, as confirmed by the ablation study in
Section 4.4. Each RINN adopts the X3D-M backbone [
43] from PyTorchVideo (v0.1.5) [
44], initialized with randomly sampled weights instead of pretrained checkpoints. For the label-level score
, we reduce X3D’s output layer to 20 classes. Following [
20], we train a single fully connected layer to map the pretrained model’s representation
to the pseudo labels generated by the RINN,
. This layer was trained using standard procedures: the learning rate was set using a learning rate range test, and an early stopping criterion on a held-out validation set determined the number of epochs. An exhaustive hyperparameter search was not conducted for this component, as the
score’s role is to serve as a baseline demonstrating the limitations of classifier-dependent methods. This instability provides the core motivation for our main contribution, the
score, which is independent of such a classifier. For the embedding-level score
, which directly compares feature embeddings, we remove the final classification layer of X3D. Similarity is computed using minibatch-CKA, adapted from the open-source PyTorch (v2.4.0) implementation by Ristori [
45]. The resulting RINN zoo is reused across all target datasets without any fine-tuning. Consistent with real-world surveillance constraints, no target annotations are accessed during the transferability assessment.
3.3. Datasets
The source data used to train the models in the pretrained model zoo consist of four real-world surveillance videos, captured by cameras positioned at distinct locations with varying viewpoints. These video streams, sourced from live surveillance feeds in Tokyo and Agdao [
14,
15,
16], are illustrated in
Figure 4. In Tokyo, one camera captures a large urban intersection (
Figure 4a), while the other monitors a narrow alley lined with bars and restaurants (
Figure 4b). In Agdao, the Agdao-Market camera provides a close-up view of a bustling roadside market (
Figure 4c), whereas Agdao-Street captures a quieter residential street scene (
Figure 4d). Temporally, while Tokyo Street [
15] covers a short period of night time, the other three datasets are recorded during the daytime. Context-wise, the urban and market scenes (Tokyo Intersection [
14], Tokyo Street [
15], and Agdao Market [
16]) are characterized by high object density and complex layouts, leading to severe and frequent occlusions. Viewpoints range from high-angle, which reduces object scale, to eye-level, where scale can vary dynamically.
For the target data, we use three publicly available datasets: Street Scene [
17], NWPU Campus [
18], and Urban Tracker [
19]. Street Scene and NWPU Campus are designed for anomaly detection. Street Scene contains videos captured from a fixed camera viewpoint, is composed of short, discontinuous clips, and includes 205 annotated anomaly events spanning 17 categories. NWPU Campus includes footage from 43 different cameras, covering 28 distinct anomaly classes. Based on these annotations, both datasets are used for evaluating anomaly detection and event classification tasks. For anomaly detection, we adopt frame-level AUC as the evaluation metric, while event classification is evaluated using instance-level mAP. Since the original annotations only indicate abnormal frames, we manually map each target video into 1 of the 17 or 28 predefined anomaly categories described in the respective papers. Urban Tracker, originally developed for object tracking, comprises five video sequences. We utilize four of them, specifically
Sherbrooke,
Rouen,
St-Marc, and
Atrium, and exclude
René-Lévesque due to excessive downscaling that renders objects too small for reliable detection. As the dataset does not provide a predefined train–test split, we follow the protocol in [
46], assigning the first 80% of frames for training and the remaining 20% for testing. Since Urban Tracker provides object-level descriptions but not formal semantic classes, we categorize objects based on their free-text descriptions. For object tagging, we evaluate performance using frame-level mean average precision (mAP). Evaluations across these datasets are conducted in accordance with the type of annotation available. All three target datasets were recorded during the daytime and originate from geographic regions (China, North America/Europe) distinct from the source domains. A summary of the datasets used in our experiments is presented in
Table 1. Further details can be found in their respective original publications.
3.4. Baseline Models and Evaluation Protocols
Very few methods have been developed for transferability assessment under the source-free, label-free setting. Most existing approaches focus on direct domain adaptation rather than explicitly evaluating transferability. In this work, we primarily compare our two proposed transferability metrics,
and
, against Uncertainty Distance (UD) [
26] and Ensemble-based Validation (EnsV) [
27], a state-of-the-art method in model selection. Note that EnsV [
27] relies on the pretrained classifiers to generate the proxy-groundtruth for model ranking. Since our source models are feature extractors, they cannot be used with EnsV directly. Thus, to adapt EnsV as a baseline, we implement a two-step process. For each source model, we first apply a K-Means clustering algorithm to its output embeddings to generate a set of k distinct pseudo-classes. Subsequently, the label assigned to each embedding is determined by its nearest cluster centroid. To ensure a fair and controlled comparison against our own label-level baseline (
), we set the number of clusters to k = 20, matching the output dimensionality of the
classifier.
For object tagging and event classification, we attach a fully connected layer on top of each frozen pretrained model, following the protocol in [
20]. This layer is fine-tuned on the training set of the target domain, and task accuracy is used as the ground truth to assess the quality of the transferability rankings. To validate whether the estimated model transferability aligns with actual model performance, we report the Kendall’s
score [
47]. Kendall’s
ranges from
(complete disagreement) to 1 (perfect agreement), with 0 indicating no correlation, to quantify alignment between estimated and ground-truth rankings. It is worth noting that, given only five models, Kendall’s
is highly sensitive: even a single position shift can noticeably affect the score. Similarly, for anomaly detection, we follow the evaluation protocol in [
20], using Kendall’s to measure the correlation between estimated rankings and actual performance.
4. Results and Discussion
In the following sections, we evaluate our transferability estimation approach on the tasks of anomaly detection, object tagging, and event detection. The performance of our method is compared against other baselines (UD, EnsV), with the results visualized in
Figure 5 and the precise numerical scores reported in
Table 2. We then perform an ablation study to investigate the trade-off between transferability prediction performance and computational cost by varying the number of RINN’s in our approach.
4.1. Object Tagging
We evaluate performance on the object tagging task using the Urban Tracker dataset, which includes four diverse camera scenes. As shown in
Table 2,
achieves the highest Kendall’s
(0.95 ± 0.02), indicating strong and consistent alignment with the ground truth ranking. In contrast, UD yields a much lower correlation (0.17 ± 0.33), suggesting poor reliability in this task. While EnsV provides a more competitive baseline (0.41 ± 0.20), its performance is still notably lower than
, indicating that our method’s direct comparison of embedding structures is more effective than relying on an ensemble’s prediction consensus when a significant task shift occurs. Among the four methods,
also demonstrates the most stable performance, likely due to its embedding-level comparison, which better preserves structural information from the target data. By contrast,
relies on a randomly initialized classifier to map RINN features to scalar outputs, making it more susceptible to instability from weight initialization and information compression.
UD’s poor performance merits further examination. While Kendall’s
is naturally sensitive with only five models, the primary limitation lies in UD’s core assumption: that low model uncertainty on a pretext task (frame reconstruction) correlates with downstream task performance. This assumption fails in object tagging, leading UD to collapse score ranges and obscure meaningful distinctions between models. For instance, in the Rouen scene, UD misidentifies the best-performing model, underscoring how pretext task uncertainty can misrepresent object-level difficulty. Interestingly, UD performs better in tasks where pretext and target objectives are more aligned (see
Section 4.2). EnsV is also affected by this fundamental task misalignment, as its clustering is based on the same pretext task embeddings. However, its core mechanism, the reliance on a “joint agreement” from the ensemble, is a double-edged sword. On one hand, it provides a stabilizing effect that filters out noise from poorly performing models, which explains its more competitive ranking compared to UD in scenes like St. Marc. On the other hand, this same property often leads to tied or nearly identical scores among the top-competing models, making the final ranking ambiguous and potentially suboptimal.
To further validate these observations,
Figure 6 provides a scene-level comparison of selected models.
consistently identifies the top-performing model across all scenes. Although UD occasionally agrees with
, even small misorderings, such as those seen in Atrium, St. Marc, and Rouen, can drive Kendall’s
to zero. These failures typically arise when UD assigns nearly identical scores to its top three models, making its ranking vulnerable to minor fluctuations. Meanwhile,
and
show strong agreement, diverging only in the second-best model for Rouen. These results emphasize the value of analyzing the score distribution itself, especially when competing models are closely matched, rather than relying solely on rank-based metrics.
The consistent performance of reinforces its independence from output uncertainty and its robustness under task misalignment. Finally, the Mixed model (visualized within that cell as a grid of four images), which is trained on footage from all four sources, never secures first place in any scene, highlighting the practical benefit of maintaining a compact model zoo of camera-specific encoders rather than relying on a single “universal” model when computational resources allow.
4.2. Anomaly Detection
The models in the model zoo were originally trained for anomaly detection. We now evaluate how well our transferability assessment methods predict their performance on the same task across new environments. For each pretrained model, we apply a one-class SVM (OCSVM) to its extracted feature representations and compute the AUC score as the performance metric. Since
is derived from Equation (
3), we directly report
values, which range from 0 to 1. The resulting Kendall’s
values for Street Scene and NWPU Campus are summarized in
Table 2. Qualitative insights into the scene-level model selections are further illustrated in
Figure 7. Across most target scenes, all four methods, UD, EnSV,
, and
, achieve
values above 0.7, indicating strong correlation with the actual model rankings despite the absence of source data or target labels. This confirms the effectiveness of our approach in identifying well-suited models for adaptation. Notably,
consistently outperforms
in ranking stability. As also observed in the object tagging task (
Section 4.1), this difference stems from the instability of the randomly initialized classifier in
, which often yields imbalanced predictions. In contrast,
benefits from its embedding-level comparison via
, offering more stable and reliable estimates. This finding is further corroborated by the ablation study in
Section 4.4. UD also demonstrates significantly stronger performance on anomaly detection compared to object tagging. In particular, it achieves perfect alignment (
) on Street Scene, outperforming both
and
. The strong performance of both baselines is expected, as they both leverage the alignment between the models’ original pretext task and the anomaly detection objective. UD’s success is a direct result of this, as it measures uncertainty on the relevant pretext task. EnsV’s competitive ranking performance is also rooted in this alignment between the pretext task and the anomaly detection objective, as the meaningful embeddings produce effective clusters. The final ensemble further provides additional stability by aggregating these predictions. However, Street Scene involves a single fixed viewpoint and may not generalize well to more complex surveillance settings. A more comprehensive evaluation is provided by NWPU Campus, which contains 43 distinct camera perspectives.
On NWPU, both UD and EnsV maintain competitive performance, achieving a higher mean and lower variance compared to its object tagging results. It is worth mentioning that EnsV’s performance is slightly more stable compared to UD, due to its characteristic of ensembling model predictions. Nonetheless, still secures the highest overall mean , with lower sensitivity to scene variability. In contrast, remains unstable, exhibiting high variance across scenes, further emphasizing the limitations of classifier-based methods in unsupervised settings and the robustness of embedding-based comparisons.
To further visualize these trends,
Figure 7 presents scene-level model selections across NWPU Campus and Street Scene.
Figure 7a shows example frames from the target scenes, while
Figure 7b–f illustrate the top two models identified by each method. UD and EnsV generally align with the ground truth, deviating in NWPU-D01, where it misorders the second-best model, likely due to minimal differences in the scores. Interestingly,
also agrees with UD and EnsV in this case, suggesting some ambiguity in the ground-truth ranking. By contrast,
exhibits greater instability, with inconsistent top-three rankings, underscoring its sensitivity to classifier initialization and output imbalance.
4.3. Event Classification
We conclude our per-task evaluation with event classification, where each anomaly instance is assigned a semantic label, as defined in [
18]. As shown in
Table 2, the performance of both UD and EnsV drop considerably compared to their results on anomaly detection, particularly in the Street Scene dataset. In contrast, both
and
maintain stable performance across tasks, with
consistently achieving the highest Kendall’s
overall.
This discrepancy is especially notable because all four transferability metrics, namely UD, EnsV, , and , are computed without access to target labels and are therefore invariant to task changes. The only component that changes across tasks is the oracle baseline (FCN), which is retrained using event-level annotations instead of anomaly labels. Consequently, the ground-truth ranking of model performance shifts, even though the underlying target data remains the same.
To explore the implications of this task shift,
Figure 8 highlights three representative scenes where the top-ranked model by FCN changes between anomaly detection and event classification. Notably, both
and
maintain consistent selections, showing resilience to the shift in downstream objectives. In contrast, both UD and EnsV are more sensitive to these shifts. For example, in the first row (Street), UD ranks the second-best model third, which is similar to
and
and and is due to only minor differences in transferability scores. However, in the second (D01) and third (D48) rows, UD misranks the second-best model as fifth, indicating a sharper deviation from the ground truth. EnsV on the other hand, similar to object tagging, generates a tied rank when the task drifts. These cases underscore the vulnerability of UD and EnsV when their pretext assumption, uncertainty as a proxy for downstream performance, breaks under task misalignment, while embedding-based methods like
remain more robust.
Together, these results illustrate that task changes, even without altering the input data, can lead to significant divergence in model rankings. Methods like that rely on structural representation comparison rather than output uncertainty or classifier initialization offer more consistent transferability assessments under such shifts, making them particularly suitable for task-agnostic deployment in smart surveillance systems.
4.4. Ablation Study
While increasing the number of RINNs can improve the reliability of transferability assessments, it also raises computational costs. Therefore, it is essential to strike a balance between assessment stability and computational efficiency. To analyze this trade-off, we randomly initialized 100 RINNs with identical architectures and applied them to the Urban Tracker dataset to compute both
and
. We then evaluated subsets of the top
RINNs and measured the stability of their scores using two statistical metrics: Interquartile Range (IQR) and Median Absolute Deviation (MAD). IQR quantifies the spread between the first and third quartiles, while MAD measures the median of the absolute deviations from the median. Both are less sensitive to outliers than standard deviation, making them suitable for assessing stability in the presence of noisy or highly variable scores. Lower values in either metric indicate greater stability across RINNs. The results, shown in
Figure 9, illustrate how stability varies with the number of RINNs, helping to identify an optimal configuration that balances reliability and efficiency.
For both and , we observe that beyond a certain number of RINNs, increasing the ensemble size yields diminishing returns in performance stability. The results indicate that using 20 RINNs provides a good trade-off between reliability and computational efficiency. Additionally, the stability curve for demonstrates that the embedding-level assessment is consistently more robust than the label-level counterpart, reinforcing its suitability for resource-constrained deployment.
5. Conclusions and Future Work
We address the challenge of identifying the most adaptable pretrained model for source-free, label-free domain adaptation in smart surveillance, where neither the original training data nor target annotations are accessible. Universal models often fall short of the performance achieved by camera-specific models, and only a limited number of transferability assessment methods exist. Among them, approaches like uncertainty distance rely on the questionable assumption that a model’s uncertainty on one task reliably predicts its performance on another. In the absence of labeled target data, effective transferability assessment requires a task-agnostic reference embedding space; this motivates our use of ensembles of randomly initialized neural networks (RINNs) to avoid bias introduced by pretrained representations. We propose a novel and effective framework for transferability estimation in source-free unsupervised settings, specifically tailored to the unique demands of smart surveillance systems, which are characterized by heterogeneous camera configurations and pronounced domain shifts. By leveraging RINNs as unbiased feature extractors, our approach mitigates both structural and task-specific biases inherent in conventional pretrained models. This enables task-agnostic, model-independent transferability assessment. Additionally, our embedding-level metric reduces the computational overhead associated with pseudo label-based approaches, making the method scalable for large surveillance deployments. Empirical evaluations on real-world surveillance datasets demonstrate the practical utility of our method. Specifically, our embedding-level score achieved strong Kendall’s correlations with ground-truth model rankings across multiple downstream tasks, generally performing comparably to or outperforming other source-free assessment methods. For instance, obtained values of 0.95 ± 0.02 for object tagging on Urban Tracker, 0.94 ± 0.13 for anomaly detection on NWPU Campus, and 0.89 ± 0.09 for event classification on Street Scene. This consistent performance indicates that our framework can reliably identify the most adaptable pretrained models, even when the specific downstream task is unknown and both source data and labeled target data are unavailable. These quantitative findings highlight the value of our approach in challenging real-world source-free unsupervised scenarios, particularly in privacy-sensitive and resource-constrained smart city environments.
5.1. Limitations
Despite the positive evaluation from the conducted experiments, this work has several limitations that needed to be discussed.
First, a primary limitation arises from the symmetry of minibatch-CKA. We proposed that a large ensemble of RINNs approximates a comprehensive, information-rich embedding space for the target data. The ideal pretrained model, therefore, is one whose representations are a large subset of this embedding space. Our goal is to measure the extent of this informational overlap. However, minibatch-CKA is symmetric which indicates the degree of overall alignment between the two embedding spaces rather than measuring this one-way, subset relationship. This is compounded by potential redundancy within the RINN ensemble. Since our method does not explicitly decorrelate the features from different RINNs, it risks overestimating certain information and impairing the model ranking.
Second, our experimental validation is limited in scope. The evaluated tasks are classification-related tasks and its efficacy for other surveillance applications, such as anomaly event localization, semantic segmentation, and crowd counting, remains unverified. Furthermore, the experiments were conducted on static datasets, and the analysis does not account for temporal dynamics or concept drift, where data distributions evolve over time.
Third, the study is bounded by the limitations and biases of the datasets used. The source datasets are mostly recorded during the daytime. Geographically, these datasets are biased to specific urban and residential environments, which presents a realistic and challenging transferability scenario when assessing models on target data from different global contexts. The target datasets also possess unique constraints: Street Scene [
17] is limited by a single, fixed viewpoint and short, discontinuous clips; NWPU Campus [
18] has a known data imbalance and limited object interaction; and Urban Tracker [
19] contains non-standard annotations and variable technical quality. Despite these limitations, by intentionally evaluating our method across this diverse and complementary set of target domains, we have subjected our approach to a more holistic and stringent test of its robustness. While each individual experiment is bounded by the nature of its data, the overall experiments provide broader evidence of our method’s applicability across varied conditions.
Finally, practical and computational factors present further considerations. While the method avoids the cost of full fine-tuning, it incurs its own practical costs. The need to generate and store RINN weights creates a non-trivial memory footprint and computational overhead. This paper does not include a formal cost–benefit analysis comparing this assessment overhead to other lightweight baseline strategies.
Acknowledging these boundaries is crucial for contextualizing our findings and provides a clear roadmap for future research.
5.2. Future Work
In future work, our research direction will focus on several key aspects, focusing on methodological refinement, expanded empirical validation, and deeper theoretical analysis.
First, we will refine our core assessment method. This involves exploring asymmetric embedding similarity metrics to better estimate directional information transfer between representations. We will also investigate the impact of different RINN architectures and initialization strategies to more effectively construct the ensemble. Although RINNs require no training, they possess a non-trivial memory footprint. While the current assessment was conducted on resource-rich cloud infrastructure, developing a more memory-efficient implementation is crucial for large-scale deployment. Furthermore, our current approach is a one-step process in assessing the model transferability. To further adapt the selected model to target data, a more efficient model adaptation strategy must be investigated.
Second, we will expand the scope of our experimental validation. This includes applying the method to non-classification tasks (e.g., semantic segmentation), extending the analysis to streaming data to assess performance against temporal drift (e.g., changes in lighting and weather), and evaluating against target datasets from more diverse geographical and temporal contexts to ensure robustness. This expansion also includes a plan to expand the diversity of the model zoo. The current study utilized source models from the video surveillance domain; a comprehensive investigation using models from disparate domains (e.g., general image or video datasets such as ImageNet or Kinetics) will be necessary to rigorously test the framework’s generalizability.
Third, we will conduct a comparative analysis of the use of pretrained embeddings and RINN embedding. Our method avoids reliance on pretrained embeddings; this design is motivated by observations in prior work and our own empirical findings suggesting that such embeddings do not generalize reliably when the downstream task differs from the pretext task. This issue is particularly relevant in smart surveillance scenarios, where source models are often trained for general purposes, while the downstream task (e.g., anomaly detection, event classification, or object tagging) is not fixed at deployment. While our approach demonstrates greater consistency, it should be understood as a robust alternative to, rather than a direct solution for, the poor generalization of pretrained embedding-based methods under task drift. Future work could conduct a theoretical analysis of how pretrained embedding-based methods behave under task uncertainty, and whether RINN embedding-based assessment can adapt such reliance when the task alignment is known and further provides more stable guidance.