1. Introduction
In recent years, the phenomenon of lonely death has emerged as a serious social issue in aging societies, particularly in Japan and other countries facing rapid demographic change [
1,
2,
3,
4,
5]. Elderly individuals living alone are at risk of passing away unnoticed, and in many cases, their bodies remain undiscovered for extended periods. This creates not only emotional and social challenges but also practical burdens for families, communities, and professional cleaning staff. The task of cleaning such environments—often contaminated with biological matter, mold, and other hazardous residues—requires specialized skills and exposes workers to significant psychological and physical stress. Reducing this human burden has therefore become an important goal in the intersection of social welfare and technological innovation.
One promising direction is the integration of robotics and artificial intelligence for autonomous environmental monitoring and cleaning. Robotic vacuum cleaners, once designed primarily for simple household convenience, are increasingly envisioned as part of a broader health and safety infrastructure [
6,
7,
8,
9,
10]. If robots could not only perform cleaning but also assist in the early detection of hazardous conditions—such as clutter accumulation, mold growth, or uninhabitable indoor environments—they could both reduce occupational strain for professionals and contribute to safer living conditions for vulnerable populations. To achieve this, however, robots must acquire more advanced perceptual capabilities, enabling them to distinguish between diverse and contaminated indoor scenes in real-world conditions. In computer vision, significant progress has been made in object recognition [
11,
12,
13], semantic segmentation, and floor-surface detection [
14,
15,
16,
17], all of which are essential components for autonomous navigation and cleaning. Research on object recognition has benefited from the rapid development of deep neural networks, allowing systems to identify furniture, debris, and personal items with increasing accuracy. Similarly, floor recognition—the ability to detect and isolate traversable areas in cluttered or low-quality images—has become a central focus, as robots must operate reliably even in challenging environments. However, while these advances provide a technical foundation, the classification of room-level conditions remains a challenging problem. In particular, distinguishing between “normal,” “cluttered,” and “hazardous” indoor environments requires not just visual cues but also semantic understanding.
Traditional approaches to scene classification have relied on raw image embeddings or handcrafted features, but these methods often fail in situations where images contain contamination, occlusion, or unusual room types. Recent progress in vision–language models (VLMs), particularly OpenAI’s CLIP, has introduced new opportunities for semantic representation. By aligning image and text embeddings in a shared feature space, CLIP enables image classification to be guided by natural language concepts, which are more interpretable and flexible than purely visual features. This capability is particularly well-suited for scenarios involving diverse and non-standard room conditions, as it allows researchers to define archetypes—for example, “clean room,” “cluttered room with dry debris,” or “moldy bathroom”—and evaluate images relative to these semantic anchors.
At the same time, scene classification for contaminated environments is not sufficient without accurate segmentation. For robotic cleaning to be effective, floor areas must be detected precisely, even under difficult imaging conditions such as shadows, stains, or biological contamination. Recently, segmentation models such as the Segment Anything Model (SAM) and DeepLabV3 have shown remarkable generalization across domains. However, no single model consistently performs well across all conditions, particularly when confronted with degraded or low-quality images. This motivates the development of hybrid segmentation pipelines that combine the strengths of multiple models along with domain-specific pre-processing strategies.
Taken together, these developments point toward a new paradigm for indoor scene understanding: one that combines semantic clustering with robust segmentation to provide interpretable, generalizable, and task-relevant representations. In this study, we propose a CLIP-Guided Clustering framework that leverages archetype-based similarity to improve scene classification and a hybrid segmentation pipeline to enhance floor region extraction. Our contributions are twofold: (1) introducing an interpretable semantic feature space defined by archetype similarity scores, which enables robust unsupervised clustering of indoor scenes, and (2) designing a hybrid segmentation approach that integrates SAM, DeepLabV3, and tailored pre-processing to achieve reliable floor recognition even in contaminated or cluttered environments.
By addressing both semantic classification and segmentation, our approach aims to bridge the gap between advances in computer vision and their application in socially significant contexts, such as reducing the burden of professional cleaners in lonely death cases and supporting the development of next-generation robotic cleaning systems. Beyond this specific application, we argue that combining vision–language models with hybrid segmentation pipelines offers a generalizable strategy for robust and interpretable indoor scene understanding.
2. Materials and Methods
2.1. Dataset
The dataset utilized in this study was provided by Bright Future Support Company (Sendai, Japan) and comprises 141 images of indoor residential environments. The images cover a variety of room types and conditions, including clean and cluttered living spaces, kitchens, bathrooms, and rooms with human activity. Initially, all images were automatically classified using the CLIP model to determine the basic scene type. The distribution was as follows: living room or tatami room (N = 69), bedroom (N = 31), kitchen (N = 17), bathroom, toilet, or washroom (N = 14), entryway, hallway, or corridor (N = 8), and close-up or other images (N = 2). Approximately 48.9% of the dataset consisted of living rooms or tatami rooms, indicating a reasonable coverage of typical residential environments without excessive bias toward any single scene type. This diversity ensures that subsequent analyses address realistic scenarios for indoor cleaning and environmental monitoring. The dataset used in the experiment can be obtained for research purposes only by contacting the data owner, Bright Future Support Company (bfssendai@yahoo.co.jp). This study did not augment the dataset using techniques such as geometric transformations or color transformations.
All images were pre-processed to standardize size and color balance, minimizing the influence of lighting variations and image resolution on downstream analysis. Pre-processing included resizing to 224 × 224 pixels, normalization to CLIP’s expected input distribution, and application of contrast adjustment where necessary. This preparation ensured that embeddings generated by the CLIP model would be consistent across the dataset. The aim of the dataset preparation was not only to represent diverse residential spaces but also to support semantic classification relevant to cleaning strategies, particularly considering the presence of contamination or clutter, which is critical for applications such as robotic cleaning or post-mortem environment management. All experiments were implemented in Python 3.7 using PyTorch 2.1 and the OpenCLIP library. The UMAP and clustering analyses were performed using scikit-learn 1.4. Experiments were run on a workstation equipped with an NVIDIA RTX 4090 GPU (24 GB VRAM), Intel Core i9-13900K CPU, and 64 GB RAM (Santa Clara, CA, USA). E ach training and evaluation cycle required approximately 2.5 h (
Figure 1).
2.2. Analysis Methods
2.2.1. CLIP-Guided Clustering
To classify indoor scenes based on their semantic and contamination characteristics, we developed a novel approach termed CLIP-Guided Clustering. This construction leverages foundational research in visual language models, indoor scene classification, and semantic segmentation [
18,
19,
20]. Specifically, the proposed CLIP-guided clustering framework is designed to advance this field by effectively extracting and highlighting archetype-based semantic similarities. Unlike conventional clustering approaches that rely on raw image embeddings, this method uses similarity scores between images and pre-defined textual archetypes as feature vectors. This approach provides interpretable semantic representations and allows clustering to focus on attributes relevant to cleaning and human safety.
We defined four archetypal room types with textual prompts:
Prompt A: “a photo of a clean room with few items” (representing organized and uncluttered rooms).
Prompt B: “a photo of a cluttered room full of papers, boxes, and dry debris” (representing dry cluttered rooms).
Prompt C: “a photo of a dirty and moldy bathroom with filth” (representing wet, contaminated rooms).
Prompt D: “a photo of a room with people working in it” (representing occupied or active rooms).
For each of the 141 images, embeddings were generated using CLIP, and cosine similarity scores were computed between each image embedding and the four textual archetypes (
Figure 2). This produced a four-dimensional feature vector for each image, encoding its semantic proximity to each archetype. For example, an image might have a similarity vector such as [0.1, 0.8, 0.1, 0.0], indicating strong alignment with the cluttered room archetype.
These vectors were then used as inputs for K-Means clustering. Clustering was aimed at grouping rooms according to cleaning-relevant characteristics rather than visual similarity alone, facilitating identification of areas requiring different cleaning strategies or levels of attention.
2.2.2. Dimensionality Reduction and Clustering Validation
To visualize and evaluate clustering results, Uniform Manifold Approximation and Projection (UMAP) was applied to reduce the four-dimensional similarity vectors to two dimensions. We systematically tuned UMAP parameters (n_neighbors and min_dist) to assess robustness and optimize cluster separability. Sixteen combinations of these parameters were tested, and cluster quality was quantified using the silhouette coefficient. Higher silhouette scores indicate more distinct and well-separated clusters. The optimal configuration was identified as n_neighbors = 5 and min_dist = 0.0, yielding a silhouette score of 0.631. This analysis confirmed that archetype-based features improved clustering performance compared to using raw image embeddings directly, and that the method was robust to parameter selection.
2.2.3. Hybrid Segmentation (Optional Integration)
Although the primary focus of this study is semantic clustering, accurate floor and object segmentation is essential for downstream applications such as robotic cleaning or environmental monitoring. To address this, we developed a hybrid segmentation pipeline combining the Segment Anything Model (SAM) and DeepLabV3. Pre-processing included noise reduction, contrast enhancement, and region-of-interest extraction to handle low-quality or cluttered images. SAM generated initial object masks, capturing a wide range of objects, while DeepLabV3 focused on semantic segmentation of floors and traversable areas. The outputs were fused using morphological operations and conditional masking to accurately delineate floor regions, even in images with shadows, stains, or biological contamination. Segmentation performance was evaluated using standard metrics including Intersection over Union (IoU) and pixel-level accuracy.
2.2.4. Implementation and Reproducibility
All analyses were implemented using PyTorch (v2.1) for CLIP embeddings, scikit-learn for K-Means clustering, and UMAP for dimensionality reduction. Hyperparameters for clustering and UMAP were systematically tuned and documented. All image data were anonymized, and ethical guidelines for handling potentially sensitive images were followed strictly.
By integrating semantic clustering based on archetype similarity with robust hybrid segmentation, this framework provides interpretable and generalizable representations of indoor scenes. It enables differentiation between clean, cluttered, contaminated, and occupied rooms.
3. Results
To objectively evaluate the parameter settings of UMAP for clustering optimization, we systematically examined 16 combinations of n_neighbors and min_dist. The clustering quality for each parameter set was quantified using the silhouette coefficient, where values closer to 1 indicate better-defined clusters. Among all tested combinations, the parameter setting of n_neighbors = 5 and min_dist = 0.0 yielded the highest silhouette score of 0.631. This result demonstrates that this configuration provided the most distinct and objectively separable clustering structure (
Figure 3,
Table 1).
4. Discussion
In this study, we introduced a novel method termed CLIP-Guided Clustering, which enables the classification of indoor environments based on the quality of dirt—a feature directly relevant to the design of cleaning strategies. Unlike conventional approaches that rely solely on raw image features, our method leverages similarity scores to predefined room archetypes as new feature representations. This design allows for the semantic grouping of rooms in a manner that reflects real-world cleaning difficulty and robotic navigation patterns.
Through systematic evaluation, including UMAP parameter tuning with silhouette score optimization, we confirmed that the obtained clusters were not only interpretable but also objectively validated. The analysis demonstrated that appropriate parameter selection (e.g., n_neighbors = 5, min_dist = 0.0) is critical to achieving robust and well-separated clusters, underscoring the methodological rigor of our approach. Furthermore, quantitative evaluation revealed that the proposed CLIP-Guided Clustering method outperformed the conventional CLIP embedding baseline, achieving the highest average silhouette score across all tested configurations. This improvement demonstrates that incorporating archetype-based semantic similarity leads to more coherent and meaningful cluster formation than relying on raw visual embeddings alone. In addition to improved clustering performance, the proposed approach offers a clear advantage in interpretability. Because each cluster is characterized by explicit textual archetypes (e.g., “cluttered room” or “moldy bathroom”), the resulting groupings can be intuitively understood in terms of environmental quality and cleaning context. This interpretable structure enables both researchers and practitioners to associate visual patterns with actionable cleaning or maintenance strategies. These findings provide reinforced quantitative evidence of the robustness and novelty of the proposed method. By uniting vision–language representations with clustering optimization, CLIP-Guided Clustering establishes a semantically grounded and empirically validated framework for indoor scene understanding.
In CLIP-Guided Clustering, we developed a hybrid segmentation pipeline combining SAM and DeepLabV3, which facilitated accurate floor-region extraction even in challenging image conditions. While some images remained difficult to segment due to resolution and noise constraints, our results indicate that carefully curated training data combined with SAM substantially improves the robustness of downstream classification tasks. Importantly, this pipeline was found to outperform traditional approaches such as RGB mask–based segmentation with OpenCV, which failed to capture meaningful boundaries.
Overall, the proposed framework represents a proof of concept that integrates semantic clustering and hybrid segmentation to support context-aware cleaning strategies. Beyond cleaning robots, such methods hold promise for broader applications, including remote work assistance, autonomous indoor navigation, and human–robot interaction in complex living environments. Future work should focus on refining segmentation accuracy for low-quality inputs and extending validation to real-world deployment scenarios.
In addition to the hybrid SAM–DeepLabV3 segmentation pipeline, we also investigated several conventional segmentation approaches to benchmark performance under diverse indoor conditions. Threshold-based segmentation (Otsu’s method) relies on intensity histogram analysis to automatically determine an optimal threshold separating foreground and background regions. While computationally efficient, this method assumes bimodal intensity distributions and tends to fail in cluttered or unevenly illuminated environments, where object boundaries are not well defined or contrast is low [
21,
22,
23,
24,
25]. Canny-based or edge-detection segmentation identifies contours by locating gradients in pixel intensity. Although effective in detecting clear structural boundaries, it is highly sensitive to noise, surface texture, and lighting variations. Consequently, indoor scenes with complex floor patterns, shadows, or reflective materials often yield fragmented or incomplete segmentation results [
26,
27,
28,
29,
30]. RGB mask–based segmentation (e.g., via OpenCV color thresholding) performs region extraction by constraining specific color channels. This approach is simple to implement but lacks robustness to color variations caused by dirt, stains, or illumination differences across images, frequently leading to over- or under-segmentation [
31,
32,
33,
34,
35].
In contrast to these traditional methods—which tend to degrade in cluttered or low-light environments—our hybrid segmentation approach demonstrates greater robustness by leveraging semantic priors from vision–language models and edge consistency from SAM-based structural cues. This combination enables more stable and context-aware region extraction, even under challenging visual conditions, thereby validating the effectiveness of integrating semantic and geometric information in indoor scene analysis.
5. Conclusions
In this study, we introduced a novel method termed CLIP-Guided Clustering, which enables semantically meaningful categorization of indoor environments by leveraging similarity scores between images and pre-defined textual archetypes. Unlike conventional feature-based clustering, our approach encodes each image in terms of its resemblance to conceptual prototypes such as clean rooms, cluttered dry-debris environments, wet and mold-contaminated bathrooms, and rooms with human activity.
A dataset of 141 household images was analyzed, comprising diverse indoor scenes with a moderate predominance of living room or tatami-room settings (48.9%).
To ensure objectivity and reproducibility, clustering quality was systematically evaluated through UMAP parameter tuning. Sixteen parameter combinations were tested, and the optimal setting (n_neighbors = 5, min_dist = 0.0) achieved the highest silhouette score of 0.631. This empirical evidence highlights the critical role of dimensionality reduction parameters in enhancing cluster separability and demonstrates the methodological robustness of our framework.
Beyond clustering, we further integrated semantic segmentation methods, including Segment Anything Model (SAM) and DeepLabV3, into a hybrid pipeline for precise floor region extraction. This step is essential for downstream robotic navigation and task execution.