1. Introduction
Rapid urbanization and population growth have led to a substantial increase in municipal solid waste generation, posing significant challenges for sustainable waste management systems worldwide. Global projections estimate that annual municipal waste production will reach approximately 3.4 billion tons by 2050, placing increasing pressure on urban infrastructure, natural resources, and environmental sustainability. In addition, inefficient waste handling contributes to pollution and public health risks, particularly in densely populated urban environments.
Traditional waste management systems, which largely rely on manual monitoring, static scheduling, and limited data integration, are increasingly inadequate for addressing the scale and complexity of modern cities. These systems often lack real-time awareness of waste distribution and composition, leading to inefficient collection processes and suboptimal resource utilization [
1,
2,
3].
To address these limitations, recent research has focused on intelligent waste management systems within smart city frameworks. These approaches leverage Internet of Things (IoT) technologies, sensor networks, and data-driven optimization techniques to improve collection efficiency, reduce operational costs, and enable dynamic decision-making [
2,
4,
5,
6,
7]. However, most existing solutions primarily emphasize logistical optimization and provide limited capabilities for real-time waste identification and semantic understanding, which are essential for automation and informed decision-making.
In parallel, advances in artificial intelligence and computer vision have enabled automated waste detection and classification using deep learning models. Object detection architectures such as YOLO (You Only Look Once) have gained widespread adoption due to their ability to achieve high accuracy while maintaining real-time performance [
8,
9,
10,
11]. Recent variants, including YOLOv8 and YOLOv10, further improve detection robustness and efficiency in complex and dynamic environments. Despite these advances, most vision-based systems remain limited to predicting object labels and bounding boxes, without providing higher-level semantic interpretation or contextual reasoning.
More recently, the emergence of large language models (LLMs) and vision–language models (VLMs) has introduced new opportunities to bridge visual perception and semantic reasoning. Models such as GPT-4 and MiniGPT-4 enable the generation of natural language descriptions from visual inputs, facilitating richer interaction, interpretability, and knowledge extraction [
12,
13,
14,
15]. These multimodal systems extend beyond visual recognition by adding contextual understanding and semantic reasoning capabilities. Nevertheless, their integration into real-time waste management systems remains limited, particularly in unified frameworks that combine detection with semantic interpretation.
Despite these advancements, a critical gap persists: existing approaches tend to operate either at the level of logistics optimization or visual classification, without integrating semantic reasoning capabilities that support interpretability, recyclability assessment, and user-oriented explanations [
14,
16]. This limitation restricts the practical usability of intelligent waste management systems in real-world applications, where both accurate perception and meaningful communication are required.
To address this gap, this study proposes a multimodal vision–language framework for intelligent urban waste analysis. The proposed system integrates three complementary components within a unified pipeline:
- (i)
Real-time object detection using YOLOv8m/YOLOv10m;
- (ii)
Automated description generation through MiniGPT-4 [
12,
13];
- (iii)
Structured semantic classification and category suggestion using GPT-5 Vision.
Unlike existing approaches, the proposed approach extends beyond detection by incorporating context-aware reasoning, enabling not only accurate localization and classification of waste items but also semantic explanation and recyclability assessment.
Furthermore, this work contributes a manually annotated dataset. It also introduces a comprehensive evaluation protocol based on fixed data splits and Stratified 5-Fold Cross-Validation, ensuring methodological rigor and reproducibility. By bridging computer vision with multimodal semantic reasoning, the proposed framework advances traditional detection systems toward more interactive, explainable, and scalable solutions for smart city applications.
To the best of our knowledge, only limited prior work has combined real-time waste detection, multimodal description generation, and structured semantic reasoning within a single unified pipeline. The proposed approach highlights the potential of combining perception and reasoning in intelligent systems designed for real-world deployment [
14,
16].
3. Materials and Methods
3.4. Implementation Details of GPT-5 Vision
GPT-5 Vision is incorporated as an auxiliary semantic analysis module rather than a primary detection component. This design choice is motivated by the need to complement the strengths of object detection models, which provide accurate localization and classification, with higher-level semantic interpretation and contextual reasoning.
This separation of roles avoids overloading a single model with multiple tasks and allows each component to operate within its strengths: YOLO for spatial detection and GPT-5 Vision for semantic enrichment and category suggestion. This modular design also improves interpretability and facilitates future extensibility of the system.
GPT-5 Vision was used via API calls (inference-only) as an auxiliary semantic classification and category-suggestion module, and no fine-tuning was performed. In the proposed multimodal pipeline, GPT-5 Vision does not replace the object detector; instead, it is used to semantically analyze the content of (i) YOLO object crops (one crop per predicted bounding box) and (ii) full-scene images in multiclass scenarios.
The objective is to improve semantic interpretability and reduce errors in visually similar categories by returning structured outputs that include the predicted waste category, a short description, and recyclability information. The experiments were conducted on the validation subset of the dataset using the GPT-5 Vision API under an inference-only setting.
Input preparation. After YOLO detection, each predicted bounding box was used to crop the corresponding region from the original image (RGB). For multiclass experiments, the entire image was provided to GPT-5 Vision to use global scene context. Prior to submission, images (crops or full scenes) were encoded in Base64 and embedded in the API request payload.
API request configuration. To reduce response variability and support reproducibility, we used a fixed inference configuration for all calls. Unless otherwise stated, we used low-temperature decoding (temperature = 0.1) to encourage deterministic, structured outputs and bounded response length (max_tokens = 256) to keep responses concise and consistently formatted.
Structured prompting and output schema. The model was instructed to return JSON-only outputs (no markdown, no extra text) using a constrained set of categories. Enforcing a strict schema and validating outputs programmatically is aligned with recommended practices for reliable use of large language models in structured workflows [
13]. In addition to the six dataset classes used for YOLO training (Cardboard, Glass, Metal, Paper, Plastic, Trash), two auxiliary categories (“Organic” and “Other”) were included in the GPT-5 Vision schema to handle out-of-taxonomy objects, background clutter, or ambiguous materials not strictly belonging to the detector’s training set. The prompt template and output schema used in this work are shown in Listing 1 and Listing 2, respectively.
| Listing 1. GPT-5 Vision prompt template (single object/crop; JSON-only). Note: The GPT-5 schema intentionally excludes ‘Trash’; see evaluation protocol for how ‘Trash’ ground-truth images are handled. |
Analyze the provided image (object crop) and return ONLY valid JSON (no markdown, no extra text).
Choose category from: ["Plastic","Paper","Glass","Metal","Cardboard","Organic","Other"].
Return exactly this JSON object: { "category": "<one of the allowed categories>", "description": "<short factual description (max 25 words)>", "recyclable": <true/false> }
Rules: - Output must be strictly valid JSON. - Do not add any text outside the JSON. |
| Listing 2. GPT-5 Vision output schema (multiclass scene; JSON-only). Note: The GPT-5 schema intentionally excludes ‘Trash’; see evaluation protocol for how ‘Trash’ ground-truth images are handled. |
Analyze the full image and return ONLY valid JSON:
{ "detections": [ { "category": "<Plastic|Paper|Glass|Metal|Cardboard|Organic|Other>", "description": "<short factual description (max 25 words)>", "recyclable": <true/false> } ] }
Rules: - If no waste items are present, return "detections": [] - Output must be strictly valid JSON. - Do not include explanations outside the JSON. |
Programmatic validation and error handling. All responses were validated automatically. First, outputs were parsed using a JSON parser. Second, schema checks verified the presence and types of required keys (e.g., category, description, recyclable) and ensured that the category belonged to the allowed set. If parsing or schema validation failed, the request was reissued once with a stricter instruction emphasizing “JSON-only output.” Responses that remained invalid after the retry were logged and excluded from downstream analysis.
Inference procedure. For each input image, a structured inference pipeline was applied using GPT-5 Vision. Images were first encoded and submitted to the model via API calls, together with a predefined prompt that enforced a JSON-formatted response.
The prompt required the model to return three fields: (i) category (Plastic, Paper, Glass, Metal, Cardboard, Organic, Other), (ii) a short textual description of the object, and (iii) a recyclability indicator (true/false). The inclusion of “Organic” and “Other” categories enables the model to represent materials that do not directly map to the six-class detection taxonomy or correspond to organic waste types not explicitly annotated in the dataset.
Model responses were automatically validated through JSON parsing to ensure syntactic correctness. Valid outputs were stored and aggregated for further analysis.
To ensure consistency between detection and semantic evaluation, two related but distinct taxonomies were used in this study. The YOLO-based detection models operate on six categories (Plastic, Paper, Glass, Metal, Cardboard, Trash), while the GPT-5 Vision module uses an extended semantic taxonomy that includes “Organic” and “Other” to improve expressiveness and robustness.
Finally, a category distribution analysis was performed to examine the frequency of material types predicted by GPT-5 Vision.
Ground-truth definition and evaluation protocol. Ground-truth labels used for GPT-5 Vision evaluation were derived from the manually annotated YOLO validation dataset described in
Section 3.1. Each validation image has one or more annotated object categories from the six-class waste taxonomy (Cardboard, Glass, Metal, Paper, Plastic, Trash).
GPT-5 Vision was evaluated at the image level. For single-object images, the predicted category returned by GPT-5 Vision was directly compared to the corresponding annotated ground-truth label. For images containing multiple annotated objects, the GPT-5 Vision output was evaluated using a relaxed image-level semantic agreement criterion: a prediction was counted as matched if at least one predicted category corresponded to at least one ground-truth category present in the image. This protocol was adopted because GPT-5 Vision was not used as a bounding-box detector in this study and did not aim to exhaustively recover all annotated instances in the scene. Accordingly, the resulting metrics should be interpreted as indicators of semantic alignment under a constrained image-level protocol, not as strict multi-label completeness or object-level detection accuracy
True Positives (TP), False Positives (FP), and False Negatives (FN) were computed based on category agreement at the image level. Precision, Recall, and F1-score were then calculated as
Because GPT-5 Vision does not produce spatial bounding boxes in this protocol, IoU-based detection metrics such as mAP are not applicable.
It is important to emphasize that this protocol reflects image-level semantic agreement rather than strict multi-label completeness. For multi-object images, a prediction was considered matched if at least one ground-truth category was identified; therefore, this protocol represents a relaxed matching criterion rather than strict multi-label completeness. This relaxed criterion may overestimate performance in complex multi-object scenes because it does not require the model to identify all annotated categories present in the image.
Therefore, the reported GPT-5 Vision metrics should be interpreted as semantic alignment indicators rather than exhaustive object-level verification performance. Within the proposed pipeline, GPT-5 Vision is therefore positioned as an auxiliary semantic interpretation and category-suggestion module, rather than a strict verification mechanism. Future work will adopt stricter evaluation protocols, including multi-label completeness scoring and per-crop object-level evaluation using YOLO bounding boxes.
The proposed methodology follows a modular design that separates detection, description, and semantic reasoning into distinct components. This design improves system interpretability, enables independent evaluation of each module, and facilitates future extensions or component replacement without affecting the overall pipeline.
All implementation details related to model training, multimodal integration, and evaluation protocols are described in this section to ensure clarity and reproducibility. The Results section focuses exclusively on performance analysis and empirical findings.
4. Results
This section presents the experimental results of the proposed multimodal framework, including object detection performance, cross-validation analysis, and multimodal semantic evaluation. The analysis includes a comparative evaluation of the YOLOv8m and YOLOv10m object detection models, performance verification using Stratified K-Fold Cross-Validation, and the integration of Vision-Language Models (MiniGPT-4 and GPT-5 Vision) for semantic classification and descriptive generation. The experiments collectively assess model accuracy, performance stability, and the interpretability of results in both single-class and multi-class waste detection scenarios.
A comparative analysis is conducted on the visual and quantitative results of two models, YOLOv8m and YOLOv10m, trained on a dataset for object detection. The analysis is based on several key aspects of performance: the learning curve during training, the Mean Average Precision (mAP), precision/recall for individual classes, as well as metrics such as the effectiveness of inference and the training time.
Since the first epoch, the YOLOv8m model has shown consistently lower training loss than the YOLOv10m model, indicating that it learns faster. As can be seen in
Figure 2, the total loss of YOLOv8m is lower, while YOLOv10m starts with a much higher loss value (close to 20). The loss curve gradually decreases and stabilizes after approximately 40–50 epochs. The gap between training and validation loss remains relatively small for both models, suggesting limited overfitting and reasonable generalization. YOLOv8m converges faster and reaches a lower final loss (≈1.8 after 100 epochs) compared to YOLOv10m (≈3.8), indicating more efficient learning under the selected training configuration. Mean Average Precision (mAP) evaluates the overall accuracy of the model in object detection:
As shown in
Table 1, although the difference is small, YOLOv8m improves mAP@50 by approximately 1.0 percentage point (90.5% vs. 89.5%) and mAP@50–95 by approximately 1.1 percentage points (87.1% vs. 86.0%). Overall, YOLOv8m achieved slightly higher accuracy and efficiency under the evaluated setting.
The training dynamics of YOLOv8m and YOLOv10m, in terms of mAP progression across epochs, are shown in
Figure 3.
To further analyze model behavior, class-wise precision and recall were examined. Precision measures the correctness of model predictions, whereas recall measures detection completeness.
Figure 4 clearly shows that YOLOv8m achieves higher precision and recall across almost all categories compared to YOLOv10m. In particular:
YOLOv8m also shows consistently higher recall across all categories, indicating fewer missed detections. Although the differences are moderate, they translate into improved reliability under real-world conditions.
Figure 4.
Comparison of YOLOv8m and YOLOv10m via Confusion Matrices (Trash, Plastic, Paper, Metal, Glass, Cardboard). (a) YOLOv8m. (b) YOLOv10m.
The normalized confusion matrices in
Figure 4 provide detailed insight into inter-class performance.
For YOLOv8m, the Glass and Paper classes exhibit the highest true positive rates. Plastic and Trash display more frequent confusion with the Background class.
YOLOv10m shows a slight decrease in performance for certain classes (e.g., Metal), but a marginal improvement for Plastic. Overall, both models demonstrate similar inter-class behavior patterns, although YOLOv8m maintains slightly stronger diagonal dominance across most categories.
As shown in
Table 2, YOLOv8m achieves approximately 120 FPS with 8.3 ms latency per frame, compared to YOLOv10m’s 105 FPS and 9.5 ms latency.
YOLOv8m achieves slightly higher accuracy and efficiency, making it suitable for real-time applications under the evaluated conditions.
The observed differences should be interpreted as dataset and hyperparameter-dependent rather than universally applicable.
Figure 5 and
Figure 6 present the F1–Confidence curves. YOLOv8m achieves its best F1-score (0.87) at a confidence threshold of ~0.51, showing a strong balance between precision and recall.
YOLOv10m reaches a lower maximum F1-score (0.82) at a higher threshold (~0.58), suggesting greater sensitivity to confidence calibration.
4.1. Quantitative Evaluation of MiniGPT-4 Descriptions
To complement the qualitative analysis, a structured quantitative evaluation of the generated descriptions was conducted. A randomly selected subset of 100 detected object samples from the validation set was manually assessed according to three predefined objective criteria:
Factual Consistency—The description accurately reflects the visual content of the detected object.
Category Alignment—The description is semantically consistent with the predicted waste category.
Recyclability Correctness—The recycling guidance provided is appropriate for the actual object type shown in the image.
Each criterion was evaluated using a binary scoring scheme (1 = correct, 0 = incorrect). The evaluation was performed manually according to predefined criteria in order to provide a consistent assessment of semantic accuracy and recycling guidance.
The quantitative evaluation of the multimodal description module across factual consistency, category alignment, and recyclability correctness is presented in
Table 3.
The results indicate that the description module demonstrates very strong visual grounding, with 99% factual consistency, confirming that the generated descriptions accurately reflect the visual content of the detected objects.
Category alignment reached 71%, suggesting that while most descriptions remain semantically related to the predicted waste category, certain discrepancies arise in cases of visually similar materials or ambiguous object presentations.
Recyclability correctness achieved 81%, indicating generally reliable recycling guidance. Minor inaccuracies were primarily associated with ambiguous material identification or generalized recycling assumptions.
Overall, the findings suggest that the multimodal description module provides visually grounded descriptions and generally accurate recycling-related guidance, while still leaving room for improvement in category-level semantic alignment.
Following object detection with YOLOv8, the second stage of the pipeline integrates a Large Language Model (LLM) to generate textual descriptions for each detected object.
MiniGPT-4 processes cropped object regions and optionally incorporates predicted class labels to generate context-aware descriptions. The process is fully automated, with each detected object passed to the model using a structured prompt for descriptive output generation.
These outputs support downstream tasks such as visualization, reporting, and user guidance. The evaluation includes both qualitative examples and quantitative analysis.
For illustration, representative outputs are presented for different material categories. For a metal object (e.g., bottle cap), the model generates descriptions that capture material properties and recycling recommendations. Similarly, for paper (envelope) and plastic (water bottle), the generated outputs emphasize material characteristics and provide general recycling guidance. These examples demonstrate the ability of the model to produce context-aware and informative descriptions across diverse waste categories.
Beyond object detection, the integration of YOLOv8m with MiniGPT-4 introduced an additional layer of functionality by enabling the generation of customized textual descriptions for each detected object. This pipeline transcends traditional detection by providing an educational and interactive approach that informs users about the characteristics of detected waste items while offering practical recycling guidelines. The integration underscores the potential of combining computer vision models with generative language models to enhance not only technical performance but also user experience.
4.2. Cross-Validation Analysis
To further assess the stability of YOLOv8m across alternative data partitions, Stratified 5-Fold Cross-Validation was performed. The quantitative results obtained for each fold are presented in
Table 4. The bold formatting is used to distinguish the average and std rows from the individual fold rows and to improve readability.
Despite the overall strong cross-validation performance, variability across folds was observed. While stratified sampling preserves class proportions, it does not account for differences in scene composition, object scale, or visual complexity within each partition. In particular, the lower performance observed in Fold 4 (mAP@50–95 ≈ 0.84) is plausibly associated with a higher proportion of visually ambiguous scenes, as suggested by the recall–confidence curves and the representative validation examples shown in
Figure 7 and
Figure 8. These factors increase detection difficulty and highlight dataset heterogeneity beyond simple class balance. The observed variance further justifies the use of cross-validation to obtain a more realistic estimate of performance stability across stratified data partitions.
To further analyze the reduced performance observed in Fold 4, both quantitative and qualitative evaluations were conducted.
First, class-level performance metrics show that the degradation is not uniform across categories. Plastic consistently exhibits high stability across confidence thresholds, while trash and cardboard exhibit earlier recall degradation in the Recall–Confidence curves. The per-class AP values show that trash and cardboard are the most affected categories.
The normalized confusion matrix further reveals increased background confusion for trash and cross-class ambiguity between cardboard and visually related categories.
Additionally, qualitative inspection of representative validation samples from Fold 4 shows the presence of partially visible objects, deformed cardboard structures, small-scale metal/glass fragments, and objects positioned against visually uniform backgrounds. These characteristics increase background confusion and inter-class similarity.
Importantly, training and validation loss curves remain stable and converge consistently, suggesting that the observed variability is not driven by optimization instability. Rather, the reduced Fold 4 performance reflects fold-level dataset heterogeneity in object scale and structural appearance, as demonstrated in
Figure 7 and
Figure 8.
These observations provide empirical support for the fold-specific performance variation observed in
Table 4.
The obtained results indicate that YOLOv8 shows strong average performance across folds (Average Precision = 0.932 and mAP@50–95 = 0.9315). However, the fold-wise results also reveal non-negligible variability (Std: Precision ± 0.0516, Recall ± 0.0684, mAP@50 ± 0.0504, mAP@50–95 ± 0.0575). Folds 3–4 exhibit lower mAP@50–95 compared to Folds 0–2. Although stratified splitting preserves class proportions, it does not fully control for scene difficulty and intra-class variability (e.g., clutter, occlusions, reflections, and transparent materials), which may be unevenly distributed across folds. This observation motivates additional error analysis and the expansion of the dataset with more diverse in-the-wild urban scenes in future work. Overall, the results suggest that YOLOv8m maintains strong average performance across folds, while also exhibiting fold-wise variability associated with dataset heterogeneity. These findings support the usefulness of Stratified K-Fold Cross-Validation as a variance-aware evaluation procedure for the present dataset.
Figure 9 illustrates the evolution of the evaluation metrics during the training process of the YOLOv8 model on one of the folds of the Stratified K-Fold Cross Validation.
The upper section of the figure presents the training losses for the components box_loss, cls_loss, and dfl_loss, while the lower section displays the corresponding metrics for the validation data. A gradual and consistent decrease in loss can be observed, indicating good model convergence. Furthermore, the metrics Precision, Recall, mAP@50 and mAP@50–95 reach high values toward the end of training, reflecting stable performance and the absence of overfitting.
Figure 10, illustrates the relationship between Recall and the Confidence Threshold for each class (paper, glass, metal, plastic, trash, and cardboard).
The thick blue curve represents the overall performance across all classes, with a Recall value of approximately 0.99, indicating that the model successfully detects most objects with high reliability across varying confidence levels.
The individual-colored curves corresponding to each class show that the model performs more accurately for materials such as plastic, metal, and glass, while exhibiting slight variations for more challenging categories such as trash and cardboard.
The normalized confusion matrix in
Figure 11 illustrates the percentage of correct detections for each class, where darker values indicate higher accuracy.
The YOLOv8 model achieved high accuracy for the classes paper (0.87), glass (0.89), and plastic (0.92), demonstrating strong capability in distinguishing recyclable materials. Slight misclassifications are observed between cardboard and trash, likely due to their visual similarity in the images.
Overall, the matrix suggests relatively consistent inter-class performance, with high Recall levels across most categories.
Vision-language models such as GPT-5 Vision provide multimodal processing capabilities that combine image understanding with natural-language generation, enabling richer semantic interpretation of scene content. These models integrate visual representations (derived from images) with semantic understanding (derived from natural language), enabling a deeper and context-aware interpretation of scene content.
In contrast to traditional object detection models such as YOLOv8 and YOLOv10, which focus solely on identifying and localizing objects within an image, GPT-5 possesses the ability to both visually analyze and semantically describe the detected objects. This allows for a richer and more interpretable form of classification that combines perception and reasoning.
To evaluate the potential of multimodal models in the task of visual classification, this study employed GPT-5 Vision as an auxiliary semantic classification and category-suggestion module following the detection phase performed by YOLOv8.
The GPT-5 model was used to directly analyze images and generate structured outputs in JSON format, including the waste category, a textual description, and an indication of whether the object is recyclable.
GPT-5 Vision was applied to the validation dataset to generate structured semantic outputs, including category labels, descriptions, and recyclability indicators. The analysis focused on evaluating the semantic consistency and interpretability of the generated outputs.
In total, 337 validation images were analyzed, all of which were successfully categorized by GPT-5 according to the primary material types.
Table 5 presents a selection of the results obtained.
GPT-5 Vision successfully analyzed 337 validation images, generating structured semantic outputs with category labels, descriptions, and recyclability indicators. For each detected object, GPT-5 provided additional semantic information, thereby enriching the classification process beyond the basic labeling performed by YOLOv8.
For example:
“A flattened cardboard packaging box”
“A clear glass jar of sliced peaches”
These results suggest that GPT-5 Vision can serve as a complementary semantic analysis component, providing structured category suggestions alongside visual recognition.
Its integration on top of YOLO’s detection outputs provides additional semantic interpretability and auxiliary category-level support in complex scenes.
4.3. Multiclass Image Analysis Using GPT-5 Vision
The objective of this experiment was to evaluate the ability of the GPT-5 Vision model to analyze images containing multiple objects from different waste categories (multiclass scenario).
In this setting, GPT-5 Vision was employed as a multiclass semantic analyzer and category-suggestion module, tasked with identifying visible objects, assigning categories, generating short descriptions, and determining recyclability. A total of 13 images were selected, each containing several distinct categories (e.g., plastic, metal, cardboard, paper, glass, organic, other). Each image was analyzed using the GPT-5 Vision API with a structured JSON prompt (see Listing 3).
| Listing 3. Example GPT-5 Vision multiclass output (JSON format). |
{ "detections": [ {"category": "Plastic", "description": "water bottle", "recyclable": true}, {"category": "Metal", "description": "aluminum can", "recyclable": true} ] } |
The model was guided to
Identify each distinct object within the image.
Assign the correct category from the list [Plastic, Paper, Glass, Metal, Cardboard, Organic, Other].
Provide a short textual description.
Indicate whether the object is recyclable or not.
The results obtained for each image were stored in a CSV file (multi_detection_results.csv) for subsequent statistical analysis and visualization.
From the 13 tested images, GPT-5 Vision successfully identified a total of 90 individual detections.
Figure 12 illustrates the distribution of categories detected by GPT-5 Vision in multiclass images.
The category distribution graph shows that Plastic (≈32 detections) and Metal (≈30 detections) are the most frequently identified categories, followed by Cardboard (≈18 detections) and Glass (≈6 detections), while Paper, Organic, and Other appear less frequently. It should be noted that the auxiliary categories Organic and Other appear only in the GPT-5 prompt schema and therefore represent semantic outputs that fall outside the six-class benchmark taxonomy used for detector training.
This distribution reflects the predominance of certain object types within the evaluated sample and illustrates GPT-5 Vision’s ability to generate structured multi-object outputs at the image level. However, these counts should be interpreted as indicators of semantic category presence rather than strict multi-label performance accuracy, since evaluation was conducted using an image-level agreement criterion.
In one representative multiclass example, GPT-5 Vision produced the following structured output (see Listing 4).
| Listing 4. Example GPT-5 Vision structured output for a multiclass scene. |
{ "detections": [ {"category": "Plastic", "description": "A transparent plastic cup", "recyclable": true}, {"category": "Metal", "description": "A crushed aluminum can", "recyclable": true}, {"category": "Cardboard", "description": "A flattened brown cardboard box", "recyclable": true} ] } |
GPT-5 Vision was evaluated using an image-level semantic agreement criterion. In multi-object images, predictions were considered matched if at least one ground-truth category was identified. This protocol does not require prediction of all annotated categories and therefore does not represent strict multi-label completeness scoring. Accordingly, the reported results should be interpreted as image-level semantic agreement rather than exhaustive multi-object recognition accuracy. Within the proposed framework, GPT-5 Vision is positioned as an auxiliary semantic classification and category-suggestion module rather than an object-level verification component. Future work will incorporate stricter object-level and multi-label evaluation protocols.
In this example, GPT-5 successfully
Distinguished three different categories within the same scene;
Provided accurate descriptions for each object;
Correctly assessed the recyclability status of each category.
This example illustrates the multimodal reasoning capability of GPT-5 Vision to interpret complex visual scenes and generate structured, semantically accurate responses.
Table 6 summarizes the category distribution obtained from GPT-5 Vision in multiclass image analysis.
GPT-5 Vision performs well in identifying common categories such as plastic, metal, and cardboard. Minor errors occur when objects overlap or exhibit strong reflections (e.g., transparent plastic near glass). The structured JSON output enables straightforward automated analysis and statistical visualization.
Metrics for GPT-5 Vision were computed against the six-class dataset ground truth (Plastic, Paper, Glass, Metal, Cardboard, Trash). Since ‘Trash’ is not included in the GPT-5 allowed output categories (Listings 1 and 2), GPT-5 predictions were evaluated only on the overlapping classes. Ground-truth images labeled as ‘Trash’ are therefore reported as non-overlapping/out-of-taxonomy cases for GPT-5 in this protocol.
It is important to note that GPT-5 Vision was evaluated as an image-level semantic classification and category-suggestion module rather than as a bounding-box detector. Ground-truth labels were obtained from the manually annotated validation set used for YOLO training. For single-object images, GPT-5 predictions were directly compared with the annotated label. For multi-object images, predictions were considered correct if they matched at least one annotated category in the image. The reported Precision, Recall, and F1-score reflect category-level agreement at the image level. IoU-based metrics such as mAP are not applicable since GPT-5 Vision does not generate spatial bounding-box outputs.
The comparison presented in
Table 7 should be interpreted cautiously, as the two systems operate under different evaluation paradigms. YOLOv8 performs object-level detection with bounding boxes, whereas GPT-5 Vision was evaluated as an image-level semantic classification module. Therefore, the metrics reported for GPT-5 Vision reflect semantic category agreement rather than spatial detection performance. The purpose of this comparison is to illustrate the complementary semantic capabilities of multimodal vision–language models rather than to provide a direct benchmark against object detection systems.
Because the two systems were evaluated under different protocols, the values in
Table 7 should be interpreted descriptively rather than as a direct benchmark comparison. These results indicate that GPT-5 Vision can function as a complementary semantic analysis component for multicategory images, providing structured category suggestions alongside traditional object detection approaches.