1. Introduction
Anomaly detection is a fundamental task in machine learning and computer vision, with broad applications in medical diagnosis, video surveillance, network security, remote sensing, and financial risk monitoring [
1]. In these scenarios, the goal is to identify rare or abnormal patterns that deviate from expected normal behavior. Among these applications, industrial anomaly detection plays an important role in modern manufacturing and quality inspection, where reliable defect detection is required under diverse deployment conditions [
2,
3]. Traditional unsupervised anomaly detection methods usually rely on a set of normal training images to learn category-specific normal patterns. Although these methods have achieved strong performance on standard benchmarks, their dependence on target-domain normal data limits their applicability to newly introduced products. In practical industrial scenarios, new categories may need to be inspected immediately, while collecting and annotating sufficient normal samples for each product is often expensive or impractical. This motivates zero-shot anomaly detection (ZSAD), which aims to detect anomalies in unseen categories without using target-domain training samples [
4].
With the strong generalization ability of vision–language models, CLIP-based ZSAD has become an active research direction. Recent surveys also highlight that Transformers and foundation models are reshaping visual anomaly detection by improving contextual modeling, transferability, and zero-/few-shot generalization [
5,
6]. Most existing methods compare test images with predefined or learned normal/abnormal textual prompts, or improve visual–text alignment through prompt learning, anomaly-aware semantic modeling, context adaptation, and local feature enhancement [
7,
8,
9,
10,
11,
12,
13]. These methods have achieved promising zero-shot anomaly classification and localization performance. However, they usually process each test image independently and therefore underuse the relationships among test images in real industrial inspection streams.
To exploit such test-time relationships, recent studies have explored batch and online zero-shot anomaly detection. Batch methods such as MuSc [
14] use mutual scoring among multiple unlabeled test images, showing that inter-image relationships can provide useful anomaly cues. However, batch-level inference usually requires access to multiple test images at the same time and may introduce additional computational cost. More recently, RareCLIP [
15] introduced an online zero-shot anomaly detection paradigm, where test images are processed sequentially and the memory bank evolves during inference. By combining CLIP with dynamic rarity modeling, RareCLIP exploits the observation that normal patches tend to appear consistently over time, whereas abnormal patches are rare and variable. It therefore maintains online patch memories to estimate rarity during sequential inference.
Despite this progress, RareCLIP-style online memory evolution raises a new reliability issue. Unlike conventional memory-bank methods, where memory is constructed from clean normal training samples, the online memory in RareCLIP is updated from unlabeled test streams. Therefore, the evidence used for memory updating may involve normal regions, defective regions, ambiguous visual patterns, or category-specific appearance variations. Once unreliable evidence is incorporated into the memory, it can affect subsequent predictions and lead to error accumulation. This issue is particularly important for weak or visually unstable categories, where anomaly evidence is subtle and the original update rule may not always provide reliable memory evolution. While existing online or test-time methods mainly focus on nominal model adaptation, multi-image inference, or anomaly scoring, we focus on the reliability control of the memory evolution process itself.
Motivated by this observation, we propose TSMR, a TPB-guided Selective Memory Refinement strategy for online zero-shot anomaly detection. TSMR is designed as a lightweight extension of RareCLIP. Instead of modifying the CLIP backbone or redesigning the anomaly scoring pipeline, it focuses on regulating the online memory update process. Specifically, TSMR combines confidence-based self-calibration, text-prior-based reliability assessment, and weak-class selective activation to derive a frame-level memory-update decision during sequential inference. In this way, TSMR preserves the efficiency and transferability of RareCLIP while improving the reliability of online memory evolution for difficult categories.
Our main contributions are summarized as follows:
We identify the reliability of online memory updating as a key issue in RareCLIP-style online zero-shot anomaly detection. Since online memory is updated from unlabeled test streams rather than clean normal training data, unreliable or ambiguous update evidence may affect subsequent predictions, especially for weak or visually unstable categories.
We propose TSMR, a class-sensitive text-prior-based (TPB) selective memory refinement strategy for online memory evolution. Different from uniform online updating, TSMR integrates confidence-based self-calibration, text-prior-based reliability assessment, and weak-class selective activation to derive a frame-level memory-update decision during sequential inference, without modifying the backbone encoder or the anomaly scoring pipeline.
We provide object-wise and seed-wise analyses on VisA and MVTec AD. The results show that TSMR improves the reproduced RareCLIP baseline on VisA, with gains mainly concentrated on selected weak categories, while maintaining competitive performance on MVTec AD.
The remainder of this paper is organized as follows.
Section 2 reviews related work on CLIP-based zero-shot anomaly detection and online memory adaptation.
Section 3 presents the proposed TSMR framework.
Section 4 describes the experimental settings, quantitative and qualitative results, ablation studies, and cost analysis.
Section 5 concludes the paper and discusses future directions.
3. Method
3.1. Overview
Here, we study online zero-shot anomaly detection for industrial images, where no target-domain training samples are used during downstream deployment. Given a test image , the objective is to predict both a pixel-level anomaly map and an image-level anomaly score , while allowing the model to refine its memory state sequentially during inference.
We propose TSMR, a TPB-guided Selective Memory Refinement framework built upon the reproduced RareCLIP baseline. As illustrated in
Figure 1, TSMR keeps the original visual–text encoder and anomaly scoring pipeline unchanged, and instead improves the reliability of online memory evolution through a selective update strategy.
We explicitly distinguish the prediction outputs from the memory-update output. For the current test image , the reproduced RareCLIP prediction pipeline uses the current memory state to produce the pixel-level anomaly map and the image-level anomaly score . The memory-assisted matching module provides reference-matching and rarity-related evidence for anomaly map construction and image-level aggregation, while the text-prior branch provides complementary semantic anomaly cues. After the prediction for is obtained, TSMR regulates the transition from to . Therefore, the selective memory refinement module is not the anomaly prediction output for the current image; instead, it updates the online memory and affects subsequent predictions through the refined memory state.
Specifically, TSMR consists of four coupled components: (i) the reproduced RareCLIP anomaly reasoning pipeline, which extracts patch-level anomaly evidence and produces image-level and pixel-level predictions; (ii) a confidence quantile gate, which estimates the frame-level reliability of the current update; (iii) a text-prior-based (TPB) reliability check, which uses textual anomaly evidence to avoid unreliable memory refinement; and (iv) a weak-class selective activation mechanism, which enables the proposed refinement only for selected difficult categories to avoid unnecessary interference on already stable classes.
Different from retraining-based adaptation methods, TSMR does not introduce a new optimization stage or alter the backbone architecture. Instead, it acts as a lightweight plug-in refinement rule at test-time. In the online setting, patch-level confidence and TPB evidence are used to derive a frame-level memory-update decision. When the update condition is satisfied for an activated weak category, the online memory is refined from
to
for subsequent inference. In this way, TSMR improves the stability of test-time updating while preserving the efficiency and transferability of the original zero-shot framework. At inference time, anomaly maps and image-level scores are still computed by the original RareCLIP scoring pipeline, whereas the proposed refinement affects prediction indirectly through a more reliable memory state. The main symbols and their definitions are summarized in
Table 1.
3.2. Baseline Online Memory Update
Let
denote the patch-level visual features extracted from the current test image, where
N is the number of patches. In the baseline RareCLIP framework, the online memory is updated sequentially during inference. Denoting the memory state before processing the current image by
, the baseline update can be abstractly written as
where
denotes the original RareCLIP memory-update operator.
This update enables the memory to evolve with the incoming test stream and improves the adaptability of online zero-shot anomaly detection. However, the baseline update is applied to each incoming frame in a relatively uniform manner. Since the test stream is unlabeled, the current frame may contain normal regions, defective regions, ambiguous visual evidence, or category-specific appearance variations. If unreliable evidence is repeatedly incorporated into the memory, it may affect subsequent predictions. This issue is more evident for weak or visually unstable categories, where the reliability of online memory evolution is harder to estimate.
Based on this observation, our goal is not to redesign the anomaly scoring pipeline, but to regulate whether the original memory-update operator should be applied to the current frame. In other words, TSMR introduces a frame-level reliability gate before the baseline memory update, while keeping the backbone encoder and the RareCLIP scoring functions unchanged.
3.3. TPB-Guided Selective Memory Refinement
Figure 2 illustrates the core idea of the proposed frame-level selective memory refinement mechanism. Given the current online memory state
and the incoming test frame
, TSMR estimates whether the current frame is reliable enough for memory refinement. Specifically, patch-level matching confidence and text-prior-based evidence are first computed from
and then aggregated into frame-level gates. These gates produce the final update decision
, which determines whether the original RareCLIP memory-update operator should be applied to the current frame. If the gate is satisfied, the memory is updated from
to
using the original update operator; otherwise, the memory state remains unchanged. Therefore, TSMR regulates the frame-level transition of the online memory, rather than sparsely writing individual selected patches into memory.
3.3.1. Reference Confidence Quantile Gate
We first use the memory-assisted matching confidence to estimate whether the current frame is reliable for online memory refinement. Let
denote the reference confidence score of patch
, which is obtained from the memory-assisted matching branch. We compute the frame-level mean reference confidence as
To avoid using a fixed confidence threshold across different online streams, we maintain a history set
of previous frame-level confidence values. Let
denote the quantile parameter and
denote the minimum number of historical frames required to activate this gate. The reference confidence quantile gate is defined as
where
is the indicator function. When the history is not sufficiently long, this gate is not used to block memory updating. Otherwise, the current frame is allowed to pass the confidence gate only when its mean reference confidence is not lower than the historical quantile threshold. After the gate is evaluated for the current frame,
is appended to
for subsequent online calibration. This design provides a self-calibrated frame-level confidence check for online memory refinement.
3.3.2. Text-Prior-Based Frame Consensus Gate
Confidence alone is not sufficient, because a frame may still contain suspicious regions even when the visual features can be confidently matched to the memory. Therefore, we further introduce a text-prior-based (TPB) frame consensus gate. This gate combines memory-assisted reference evidence and text-prior anomaly evidence to judge whether the current frame is suitable for memory refinement.
Let denote the reference consistency score of patch i from the memory-assisted matching branch, and let denote the text-prior abnormal probability of patch i. In our implementation, and are both derived from the patch-level reference confidence scores produced by the memory-assisted matching branch. We use to compute the frame-level mean confidence for quantile calibration, and to denote the patch-level reference consistency evidence used in the TPB frame consensus gate.
A patch is regarded as reliable for memory refinement when it has sufficiently high reference consistency and sufficiently low text-prior abnormal probability:
where
and
are the reference consistency threshold and the text-prior abnormality threshold, respectively. We then compute the frame-level reliable ratio as
The TPB frame consensus gate is defined as
where
is the frame-level fraction threshold. In our implementation, we set
. Therefore, the current frame passes the TPB gate only when at least half of its patches satisfy both the reference consistency condition
and the text-prior normality condition
. If this ratio is lower than
, the current frame is regarded as unreliable for memory refinement and is not used to update the online memory for activated weak categories. The values of
,
, and
are fixed before the final online evaluation, and the selection procedure is described in the implementation details.
3.3.3. Weak-Class Selective Activation
A key observation from our pilot experiments is that not all categories benefit equally from additional memory filtering. For many categories, the baseline RareCLIP update is already sufficiently stable, and extra gating may bring little benefit or even introduce unnecessary interference. Therefore, instead of applying the proposed refinement to all categories uniformly, we activate it only for a selected weak-class subset.
Let
y denote the category identity of the current test sample, and let
denote the activated weak-class set. The final frame-level patch-memory update gate is defined as
For categories outside the weak-class set, , and TSMR reduces to the original RareCLIP updating behavior. For activated weak categories, memory refinement is performed only when both the reference confidence quantile gate and the TPB frame consensus gate are satisfied.
The resulting online memory update is written as
Thus, TSMR controls whether the original RareCLIP memory-update operator is applied to the current frame. When the gate is activated, the current frame updates the online memory using the original update operator; otherwise, the memory state remains unchanged for subsequent inference. This frame-level gated refinement is consistent with the memory layout of the reproduced RareCLIP implementation and avoids the misleading interpretation that individual selected patches are sparsely written into memory.
3.4. Inference and Scoring
TSMR does not modify the backbone encoder, the anomaly map construction function , or the image-level scoring function of RareCLIP. For each incoming test image, prediction is performed using the current memory state before memory updating. Specifically, the current memory and visual features are first used by the reproduced RareCLIP pipeline to generate the pixel-level anomaly map and the image-level anomaly score. After the prediction is completed, the proposed selective update rule regulates whether the memory should be refined to for subsequent inference.
Formally, given the current memory state
and the patch-level visual features
extracted from the input image, the baseline RareCLIP inference pipeline produces a pixel-level anomaly map
where
denotes the baseline anomaly map construction function. Based on the resulting anomaly map, the image-level anomaly score is computed as
where
denotes the baseline image-level aggregation function.
After the prediction for the current image is obtained, TSMR selectively updates the memory according to the gated update rule, producing for subsequent inference. In the offline setting, memory updating is disabled and the evaluation is deterministic. In the online setting, the proposed selective refinement affects future predictions through the updated memory state. In this way, TSMR remains a lightweight plug-in strategy that preserves the original RareCLIP inference pipeline while improving the reliability of online memory evolution on selected weak categories.
4. Experiments
4.1. Datasets
VisA [
33] is a challenging industrial anomaly detection benchmark that has been widely adopted for evaluating zero-shot and test-time adaptive anomaly detection methods. It contains 12 product categories with both image-level anomaly labels and pixel-level ground-truth masks for anomalous samples. Compared with more canonical industrial benchmarks, VisA exhibits larger intra-class appearance variations, more cluttered backgrounds, and more diverse anomaly patterns. These characteristics make it a suitable testbed for evaluating whether a method can maintain robust anomaly discrimination and localization performance under visually complex conditions.
MVTec AD [
34] is one of the most established benchmarks for industrial anomaly detection. It contains 15 categories, including 10 object categories and 5 texture categories, with image-level labels and pixel-level ground-truth masks for anomalous samples. The anomaly types in MVTec AD cover a variety of industrial defects, such as scratches, cracks, contamination, holes, and structural irregularities. Owing to its relatively standardized acquisition setup and wide adoption in the literature, MVTec AD serves as a strong reference benchmark for evaluating both anomaly classification and fine-grained localization quality.
In this work, both datasets are used only for evaluation under the zero-shot setting. No target-domain training images are used to train or fine-tune the backbone model or the proposed TSMR module. For each dataset, we follow the official category definitions and use the image-level labels and pixel-level ground-truth masks for computing I-AUC, P-AUC, and PRO.
For online evaluation, each object category is treated as an independent sequential test stream. The online memory is initialized before processing a category, updated sequentially as test images arrive, and reset before evaluating the next category. Thus, the memory evolution is category-specific and does not mix information across different object categories. Within each category, different random seeds correspond to different test-image orders, which are used to evaluate the stability of online memory updating.
VisA is used for the main analysis of the proposed TSMR refinement, including object-wise comparison, seed-wise stability analysis, weak-class activation, and hyperparameter sensitivity. MVTec AD is used as an additional widely adopted benchmark to examine whether the proposed memory refinement remains competitive under the same zero-shot and online evaluation protocol. This usage reflects the goal of our method: improving the reliability of RareCLIP-style online memory evolution, especially on visually challenging categories, while preserving the performance and transferability of the original zero-shot framework.
4.2. Evaluation Metrics
Image-level AUROC (I-AUC) [
35] measures the ability of a method to distinguish normal images from anomalous images according to the predicted image-level anomaly scores. As a threshold-independent metric, it is widely used to assess image-level anomaly classification performance.
Pixel-level AUROC (P-AUC) [
34,
35] evaluates anomaly localization quality by comparing the predicted anomaly map with the ground-truth pixel-level annotation. It reflects how well the predicted anomaly responses separate anomalous pixels from normal pixels over all possible thresholds.
Per-Region Overlap (PRO) [
34] measures the localization quality of anomaly maps from a region-aware perspective. Compared with pixel-level AUROC, PRO places more emphasis on whether anomalous regions are correctly covered as coherent defect regions, and is therefore widely used to evaluate the practical usefulness of anomaly localization results in industrial inspection scenarios.
Higher values indicate better performance for all metrics.
4.3. Implementation Details
Our model is implemented in PyTorch 1.13.1 with Python 3.10 and torchvision 0.14.1, using CUDA 11.6 on top of the reproduced RareCLIP framework. Since TSMR is designed as a lightweight test-time refinement strategy, we do not introduce an additional training stage beyond the pretrained baseline model. All comparisons are conducted under the same reproduced backbone, preprocessing pipeline, and evaluation protocol to ensure fairness. All experiments are conducted on an NVIDIA GeForce RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA).
We report both offline and online results. In the offline setting, test-time memory updating is disabled and the evaluation is deterministic. In the online setting, memory is updated sequentially during inference, and the results may depend on the order of incoming test samples. Therefore, we use five seeds, i.e., seeds 0–4, to evaluate the stability of online memory updating under different test orders. This five-seed setting follows the reproduced RareCLIP online evaluation protocol and provides a practical estimate of order-induced variation without changing the pretrained model or introducing additional training. For each seed, the random seed is fixed, and the test images are shuffled within each object category using the same seed. The categories are evaluated one by one in a fixed category order, and the online memory bank is reset before evaluating each category.
For the object-wise analysis, we evaluate each object category separately rather than pooling all categories together. For each category, I-AUC is computed using image-level normal/anomalous labels, while P-AUC and PRO are computed using pixel-level ground-truth masks for anomalous regions. The object-wise results are obtained by averaging the corresponding category-level metrics over the five seeds. The mean row is the unweighted average over all categories. For the seed-wise analysis, each seed corresponds to one complete multi-category online evaluation run. For each seed, we first compute the category-wise metrics and then report the macro-average over all categories. The mean and standard deviation are computed across the five seeds.
For the proposed TSMR gates, we set for the reference confidence quantile gate and use as the minimum history length before activating this gate. The confidence quantile parameter is analyzed separately in the sensitivity study. For the TPB frame consensus gate, three thresholds are involved: for the reference consistency condition, for the text-prior abnormality condition, and for the frame-level reliable ratio condition. We determine these thresholds using a reproducible development protocol on VisA with online seed 0. Specifically, is fixed to as a moderate reference consistency threshold on the normalized patch-level matching scores. We then perform a one-dimensional search for with , and for with , using macro-average I-AUC on VisA as the selection criterion. The selected setting is , which is fixed for all five-seed online evaluations reported in the main tables. Here, means that a frame passes the TPB gate only when at least half of its patches satisfy both the reference consistency condition and the text-prior normality condition. No per-seed tuning is performed in the reported results.
The same TPB thresholds are transferred to MVTec AD without additional threshold tuning. The activated weak-class set is determined separately for each dataset before the final five-seed evaluation; specifically, we use for VisA and for MVTec AD. For categories outside the activated weak-class set, TSMR follows the original RareCLIP memory-updating behavior. For fair comparison, under the same seed, the reproduced RareCLIP baseline and TSMR use the same pretrained backbone, preprocessing pipeline, per-category test order, and metric implementation. The only difference is the online memory-update rule.
4.4. Main Results
We first compare the proposed method with representative zero-shot anomaly detection methods on the VisA and MVTec AD benchmarks, as summarized in
Table 2. Since our method is built upon RareCLIP and mainly modifies the test-time memory refinement strategy, we group the compared methods by testing paradigm and report both datasets in a unified manner. It should be noted that MuSc* and TSMR follow different test-time inference paradigms. MuSc* leverages mutual scoring across multiple unlabeled test images and therefore belongs to a batch or multi-image inference setting. In contrast, TSMR follows the causal online protocol of RareCLIP, where test images are processed sequentially and each prediction uses only the current image and the memory accumulated from previous samples. Therefore, the comparison with MuSc* is included as a reference under different test-time information assumptions, rather than as a strictly identical online baseline comparison. In the offline setting, memory updating is disabled; therefore, the offline results mainly reflect the reproduced zero-shot scoring pipeline rather than the online memory-refinement effect of TSMR. Under the online protocol, where memory evolution is enabled, TSMR improves the reproduced RareCLIP baseline on VisA from 94.4% to 95.1% in I-AUC, from 98.8% to 98.9% in P-AUC, and from 93.5% to 94.0% in PRO. On MVTec AD, TSMR maintains the same I-AUC as the reproduced RareCLIP baseline and slightly improves P-AUC and PRO. These results indicate that the proposed selective memory refinement is most beneficial for visually challenging VisA categories, while preserving competitive performance on MVTec AD under the same online protocol.
To further remove possible confounding factors caused by implementation details, we conduct a fair in-house comparison under the same reproduced RareCLIP framework.
Table 3 presents the object-wise comparison between the reproduced RareCLIP baseline and the proposed method on VisA under the online setting. Each entry is the average over five seeds for the corresponding object category, and the mean row is the unweighted average over all categories. Compared with the reproduced baseline, our method achieves a better overall trade-off across the three core metrics while keeping the same backbone, training protocol, and evaluation pipeline. Although the average gains are naturally limited by the already strong baseline, the object-wise results reveal a clearer and more informative improvement pattern than the mean values alone.
More specifically, the benefits of the proposed method are concentrated on a subset of challenging categories rather than being uniformly distributed over the entire dataset. The most noticeable improvement is observed on macaroni2, and a clear gain is also obtained on the activated weak category capsules. This trend is consistent with our motivation that TPB-guided selective memory refinement is particularly helpful for categories whose anomaly evidence is relatively subtle or unstable under the original online update mechanism. At the same time, slight negative transfer can still be observed on a few categories, indicating that the proposed refinement is not universally beneficial to every object category.
To examine whether the observed gains are stable across different online test orders, we further report the seed-wise comparison in
Table 4. Each seed corresponds to a different within-category test order, and the reported value is the macro-average over all object categories under that seed. Following the reproduced RareCLIP online protocol, we use five seeds to estimate both the mean performance and the standard deviation caused by order sensitivity in online memory updating. The proposed method consistently outperforms the reproduced RareCLIP baseline across all five seeds on the three mean metrics, showing that the improvement is not caused by a particular test order or a single favorable run. This result is important because the online setting is inherently order-sensitive, and the seed-wise comparison confirms that the advantage of the proposed method remains stable under different evaluation orders.
Overall,
Table 2,
Table 3 and
Table 4 support two main conclusions. Firstly, the proposed method consistently improves the reproduced RareCLIP baseline on VisA in a fair setting, while remaining competitive on MVTec AD. Secondly, the improvement mainly comes from more effective refinement on selected weak or difficult categories, rather than from a uniform gain across all classes. This observation further motivates the TSMR design analyzed in the following ablation study.
4.5. Ablation Study
We next analyze two key questions behind the final TSMR configuration: which weak-class subset should be activated, and how sensitive the method is to the patch gate quantile. The corresponding results are reported in
Table 5 and
Table 6.
4.5.1. Effect of Weak-Class Subset Selection
To determine the activated weak-class set, we first perform pilot screening with single-class weak-only activation on VisA. The screening is used to identify candidate categories for ablation rather than to exhaustively search all possible subsets. Based on the pilot results and object-wise behavior, capsules and macaroni2 are selected as the main weak-class candidates, while pcb1 is included as an additional difficult candidate to examine whether enlarging the activated set further improves performance.
Table 5 studies the effect of activating different candidate weak-class subsets. The results show that a broader activation set does not necessarily lead to better performance. In particular, activating all three candidate weak classes in A1 does not improve the final mean performance and is slightly inferior to the final configuration. By contrast, activating only capsules already yields a competitive result, while further including macaroni2 gives the best overall trade-off and is therefore adopted as the final setting.
It is worth noting that the margin between the tested subsets is not large. Therefore, the role of weak-class subset selection should not be interpreted as producing a dramatic mean-level improvement by itself. Instead, its main value lies in avoiding slight negative transfer from unsuitable classes while preserving the gains on more responsive ones. This observation is consistent with the object-wise comparison in
Table 3, where the main benefits of the proposed method are concentrated on categories such as capsules and especially macaroni2. In contrast, further including pcb1 does not provide additional benefit at the final mean level. We also tested replacing pcb1 with pcb2 in the three-class activation setting, obtaining 95.11 ± 0.44 I-AUC, 98.92 ± 0.01 P-AUC, and 93.96 ± 0.16 PRO. This result is close to A3 and does not provide a clear advantage over the more selective two-class configuration. Overall, these results suggest that selective refinement should be applied to a suitable subset of weak classes rather than uniformly expanded to all candidate difficult categories.
4.5.2. Sensitivity to the Patch Gate Quantile
We further evaluate the sensitivity of the patch gate quantile in
Table 6. The results obtained with different quantiles are very close across all three metrics, indicating that the performance of TSMR is not critically dependent on delicate tuning of this parameter. We use 0.10 as the default setting because it gives the best image-level AUROC in the main evaluation, while the other tested values remain within a very similar performance range.
This insensitivity is useful for two reasons. Firstly, it suggests that the observed performance is not driven by delicate tuning of the patch gate quantile. Secondly, it indicates that after the weak-class subset is properly selected, the gating behavior remains stable within the tested quantile range. In other words, the proposed method does not rely on a narrow operating point of this hyperparameter under the final configuration.
4.5.3. Memory and Time Cost
Finally, we analyze the computational cost of the proposed online memory setting by varying
, which controls the maximum number of stored memory items. As reported in
Table 7, increasing
leads to higher GPU memory consumption, while the inference time remains nearly unchanged within the evaluated range. Specifically, the GPU memory increases from 2579 MB to 6117 MB when
grows from 50 to 1000, whereas the computation time stays at around 64 ms.
These results suggest that, under our current implementation and RTX 3090 hardware setting, increasing the online memory size mainly affects GPU memory usage within the evaluated range, while introducing only a negligible change in average inference time. Therefore, GPU memory is the more direct constraint when using a larger online memory in our experiments. Nevertheless, computation time remains critical for real-time industrial inspection, and our conclusion should not be interpreted as memory size having no effect on latency under unlimited scaling. In practical deployment, the online memory size should be selected by jointly considering the available GPU memory and the latency requirement of the target application.
4.6. Qualitative Results
We further provide qualitative comparisons to examine the anomaly map behavior of TSMR. As shown in
Figure 3, the visualization is divided into two labeled parts: activated weak categories in
Figure 3a and non-activated categories in
Figure 3b. Each row presents the input image, ground-truth mask, RareCLIP anomaly map, and TSMR anomaly map. For activated weak categories, including macaroni2, capsules, and metal_nut, TSMR produces anomaly maps comparable to RareCLIP while suppressing scattered or diffuse responses in selected cases. This is especially visible for capsules, where RareCLIP tends to activate on visually salient normal regions, whereas TSMR gives more selective responses around defective areas. These results are consistent with the goal of TSMR: regulating online memory evolution rather than redesigning the anomaly map generation pipeline.
For the non-activated categories shown in
Figure 3b, such as candle, pipe_fryum, grid, and bottle, TSMR largely preserves the behavior of the reproduced RareCLIP baseline. This is expected because these categories are not included in the activated weak-class set, and the proposed selective refinement is not intended to force changes on already stable categories. Therefore, the qualitative results illustrate two aspects of TSMR: it can reduce noisy responses in selected weak-category cases, while maintaining comparable localization behavior on non-activated categories. These observations are consistent with the quantitative results in
Table 2,
Table 3, and
Table 5, where the gains mainly come from selected weak or difficult categories rather than from a uniform improvement across all classes.
Nevertheless, some limitations remain. Since TSMR keeps the RareCLIP anomaly scoring pipeline unchanged, its visual improvements may be moderate in some samples. For categories with extremely subtle defects, strong appearance variation, or ambiguous boundaries, the predicted responses may still be incomplete or may contain residual false activations. TSMR improves the reliability of test-time memory refinement, precise region coverage and boundary delineation remain challenging in difficult industrial scenarios.
5. Conclusions
This paper studied online zero-shot anomaly detection for industrial inspection, where target-domain training samples are unavailable during downstream deployment and online memory evolves sequentially from unlabeled test streams. We proposed TSMR, a lightweight extension of the reproduced RareCLIP framework, to improve the reliability of test-time memory evolution without modifying the backbone encoder or redesigning the anomaly scoring pipeline. The core idea of TSMR is to regulate whether the online memory should be refined for the current test frame, rather than applying memory updates uniformly to all incoming samples. Specifically, TSMR integrates a reference confidence quantile gate, a text-prior-based frame consensus gate, and weak-class selective activation to derive a frame-level memory-update decision. This design reduces unreliable memory refinement on visually unstable categories while preserving the original RareCLIP updating behavior for stable categories. Experiments on VisA and MVTec AD showed that TSMR achieves clear improvements on VisA and maintains competitive performance on MVTec AD. Object-wise analysis demonstrated that the benefit of TSMR mainly comes from selected challenging categories, while seed-wise comparison confirmed that the observed improvements remain stable under different online evaluation orders. Ablation studies further verified the importance of weak-class subset selection and showed that the method is robust to a range of patch gate quantiles. Cost analysis indicated that TSMR introduces negligible additional inference time within the evaluated memory range, although larger online memory mainly increases GPU memory consumption. Despite these improvements, the current design still relies on pilot validation to determine the weak-class set and gate hyperparameters, and its gains remain concentrated on selected categories. Future work will investigate adaptive weak-category identification, automatic threshold selection, more robust anomaly map refinement for subtle defects and ambiguous boundaries, and broader validation on stronger zero-shot anomaly detection baselines.