1. Introduction and Related Work
The study of sidewalks is crucial for accessibility and quality of life in cities [
1,
2,
3,
4]. As the required level of detail is not available in satellite imagery and in situ surveys are costly, mapping such features is challenging. Using street-level imagery for this purpose is, therefore, a potential improvement. However, defining the features to be classified in street-level imagery is challenging because landscape features are conceptualized in natural language. Thus, large language models (LLMs) can be used to improve the process of image pattern retrieval from street-level imagery by allowing the use of flexible and natural language prompts. This characteristic makes the extraction of map features more intuitive and accurate by exploiting the language-understanding capabilities of LLMs, which can be applied in many contexts, such as in urban planning and transport studies, such as in [
5], where it was used to solve a community-level land-use task through using a feedback iteration. Recently, there has been considerable interest in the accessibility of these models, as demonstrated by ChatGPT, which serves as an interface for user interaction with the latest iterations of the Generative Pre-trained Transformer (GPT). This development is the most recent in a continuing series of advances in natural language processing. The study of natural language processing (NLP) can be traced back to the 1950s [
6], with early developments culminating in the establishment of the essential subtasks, such as sentence boundary detection, tokenization, part-of-speech assignment to individual words, morphological decomposition, chunking, and problem-specific segmentation [
6]. Later developments resulted in the tailoring of conventional algorithms, such as support vector machines and hidden Markov models.
Among these developments, the extension of open-vocabulary prompting for image pattern retrieval represents a substantial leap forward in NLP, driven by advances in LLMs. This approach has recently gained more attention due to its capabilities of richer expressiveness and more extensive flexibility due to its generalization capabilities [
7], its ability to capture nuances of processes [
8], and its enabling of transfer learning [
9]. The work of Zareian et al. [
10] claims to be the first to pose the problem of “Open Vocabulary Object Detection” in opposition to other similar approaches. One example of that alternative approach was “Zero Shot-Detection”, which aims to generalize from seen to unseen classes using semantic descriptions or attributes [
11,
12,
13]. Another approach would be weakly supervised detection, which focuses on detecting classes with limited information, such as image-level or noisy labels [
14,
15]. It is worth noticing that there are some main drawbacks to open-vocabulary approaches, such as a higher computational cost due to their larger complexity [
7] and the risk of misinterpretation due to the lack of constraints [
8].
This paper aims to evaluate open-vocabulary algorithms within a domain-specific pipeline for pathway segmentation and surface material classification. To achieve these objectives, we designed a two-stage pipeline that establishes a workflow for large-scale, georeferenced pathway surface classification. In Experiment 1, we segment pathways in street-level imagery, while in Experiment 2, we classify segmented pathways by surface material type. Performance is evaluated regarding the pipeline’s accuracy, adaptability to various pathway types, and efficiency in processing large-scale georeferenced images, aiming to develop a robust workflow for urban infrastructure mapping. Identification of pavement material types is crucial for maintaining road safety and ensuring the well-being of people [
16]. This process is vital for mobility studies as it influences safety, skid resistance, and road noise [
17].
Particularly for sidewalks, identifying suitable pavement types is essential for improving accessibility, urban mobility, and safety for all users, including those with reduced mobility [
18]. Additionally, using specific pavement types, like exposed aggregate concrete (EAC) and porous concrete, can significantly reduce noise and enhance safety on both roadways and sidewalks [
19]. Zeng & Boehm [
20] exploited a broader investigation of open-vocabulary algorithms, which obtained averagely good results compared to SOTA closed-vocabulary ones, observing a prompt-dependent accuracy.
Although “street” and “sidewalk” are common classes amongst many classification/segmentation datasets [
21,
22,
23,
24], being one of the reference classes for testing many proposed algorithms [
24,
25,
26], pavement type identification is a far less commonly undertaken task: convolutional neural networks were used to identify a few classes, limited to asphalt, gravel, and cement [
27], and also for the identification of asphalt damage traits, such as “pothole” or “patch” [
28], and pavement damage assessment [
29], as well as for pixel-level segmentation, limited to “paved” and “unpaved” streets [
30], and have only recently included eight specific classes, such as “granite blocks” and “hexagonal asphalt paver” [
31]. Each of these studies developed specific models with a closed set of classes, thus not allowing a wide use of the algorithms, in contrast to open-vocabulary counterparts, such as CLIP [
32] and Grounding Dino [
33], which are designed to be generalist algorithms.
Furthermore, none of these previous studies on pavement material identification follow a global standard. This work innovates by integrating its application into OpenStreetMap (OSM), the world’s most popular open and collaborative geographic database [
34]. The feature attribute standards created by the OSM community have typically been agreed upon over many years, including surface materials, although they are often not defined by domain experts. Adopting this strategy helps maintain interoperability between the developments of this research and other tools and applications linked to OSM. In this context, this work aims to test the behaviour of some state-of-the-art algorithms for open-vocabulary image classification tasks in identifying pavement types according to the OSM conceptual model, thus aiming to help design valuable building blocks for producing compliant geographic data.
This study presents a valuable proposition to the community of urban planners and geospatial analysts. By using open-vocabulary algorithms and large language models (LLMs) to segment and identify pavement materials in street-level imagery, this work advances the methodology for mapping urban features. It bridges the gap between natural language processing and geospatial data extraction, providing a more intuitive and granular approach to feature classification. By integrating with OpenStreetMap (OSM), the study improves data interoperability and sets a precedent for using open and collaborative platforms for urban data management. We also contribute to creating building blocks for producing compliant geodata by providing open resources, such as the Deep Pavements dataset and a fine-tuned CLIP-based model. These innovations contribute to more accessible, scalable, and flexible solutions for urban infrastructure analysis, with potential applications in transport planning, accessibility improvement, and public safety.
2. Methodology
The task of using NLP to extract features from images has as its main challenge the process called “grounding”, whose level means the capability of the model to associate elements of language with the proper regions in the images [
35,
36,
37]. It was shown that the relevance of the hierarchical relationships between concepts improves the grounding level by establishing ontologies [
38,
39,
40]. Ontologies bridge the gap between natural language and logical reasoning by providing a machine-readable representation (class) of real-world concepts [
41]. Our study’s methodological approach is presented in
Figure 1, which includes the main steps and processes to fulfil our study’s goals.
The motivational task behind this study can be formalized as follows: “Given a set of street view images, the objective is to segment all visible paths in each image using free-text input algorithms, identify their surface materials, and ensure that the process is not affected by the use of synonyms in the input”. Considering the hierarchical relationships between the entities of interest and their properties, an ontology was created using the Web Ontology Language (OWL) standard [
42]. This ontology is presented as a graph in
Figure 2. The hierarchy naturally starts with “pathway”, which is then specialized into “Road” and “Footway” categories. The branching follows up to two main interest categories: “Road” and “Sidewalk”. All of them, at the same time, have their main differences. Still, they fundamentally share the characteristic of being walkable, which is essentially made possible through having a surface material, hereby treated as a fundamental shared characteristic of all “pathways”. After that, “surface material” has its branching sub-ontology that is composed of two main categories, namely “unpaved material” and “paved material”, the latter being subdivided into “continuous” and “block-based”. This property sharing is expected to be visible in the results of any algorithm classifying bitmaps among these categories. With the ontology set, it is possible to establish different degrees of semantic separation among classes, highlighting pavement surfaces. Therefore, “asphalt” is more similar to “concrete” (0 degrees, same hierarchical level) than it is to “sett” (1 degree) and compacted pavement (2 degrees). This ontology is basilar to the proposed experiment, with its hierarchical relationships being called off along the study. As far as our knowledge goes, no similar ontology has been proposed before.
To undertake the present study, we designed two sequential experiments as part of a comprehensive methodology to obtain surface-labeled, segmented pathways from georeferenced terrestrial imagery. The first experiment, “Free Prompting and Random Imagery Evaluation”, focuses on pathway segmentation. Here, we aimed to test the feasibility of open-vocabulary prompting across a diverse set of 296 randomly selected images, covering multiple categories and prompts to observe general trends and the model’s responses in a realistic, large-scale processing scenario. In cases where an image was excessively blurry, it was discarded from the sample to ensure reliable segmentation results. The employed algorithm was a combination of Grounding Dino [
33] and the foundation model SAM—Segment Anything [
43]. Grounding Dino provided output bounding boxes based on the prompts, which were then used as inputs for SAM to obtain the segmented features. We used Mappillary, an ever-growing platform with billions of community-contributed CC BY-SA licensed street-view images [
44], as the data source.
The second experiment, designed for surface-labeling, namely “Evaluation of Open-Vocabulary Algorithms for Standardized Pavement Classification”, builds upon the segmentation results of the first experiment. Here, we tested standardized pavement classes prompting against images of surface patches to assess the classification ability of an open-vocabulary algorithm in a constrained setup. We employed the CLIP algorithm, known for assigning probabilities to each entry of a set of sentences that can be changed at each query, constituting a fundamental advantage, as it allows adaptation to different realities. We tested eight variations of the algorithm, each employing different datasets and training strategies, as compared in
Table 1 regarding their origin and computational costs. The data source for this experiment was the “deep-pavements-dataset”, a tailored dataset of surface patches collaboratively maintained on GitHub [
45], containing 500 samples for OSM-compliant pavement categories at the time of the experiments. The testing included all categories in the ontology, with examples shown in
Figure 3. These experiments establish a pipeline where Experiment 1 segments pathways and Experiment 2 assigns surface labels, demonstrating a workflow for large-scale pathway surface classification using georeferenced images.
It is valuable to elaborate on the reasons for selecting this set of algorithms for our study. SAM is particularly noted for its unpaired segmentation capabilities, supported by a billion-level sample size, which enhances its applicability to real-world scenarios and provides superior generalization performance [
46,
47]. Grounding Dino offers complementary strengths and significantly outperforms comparable algorithms. This advantage is attributed to its use of a transformer architecture that integrates multi-level text information [
33,
48]. CLIP marked a significant breakthrough in the vision-language domain by employing a shared embedding space for text and images created through a contrastive learning approach [
49]. It has been recognised for its robustness across various scenarios [
50,
51,
52]. These algorithms are considered zero-shot learners capable of performing tasks without being specifically optimized [
11]. Furthermore, despite their relatively recent introduction, these models have already seen widespread use in the industry for various applications, such as in automated image data annotation, image search engines, accessibility tools providing image descriptions, and enhancing content recommendation systems [
53].
Regarding the first experiment and the proposed ontology, we tested the following categories with their corresponding prompts:
Auxiliary: These entities are detected to determine if a pavement detection failure occurred due to occlusion rather than incorrect detection. Prompts are “car”, “vehicle”, “pole”, and “tree”.
Sidepaths: This query is primarily aimed at detecting sidewalks, but it may also retrieve other sidepaths, such as paved shoulders. Prompts are “sidewalk”, “sidepath”, “sideway”, “sidetrack”, and “lateral”.
Roads: Focused on detecting motorized pathways. Prompts are “road” and “street”.
Pathways: These are intended to detect any kind of traversable way. Prompts are “way”, “path”, “pathway”, “pavement”, and “track”.
Surface pavement types: Prompts directly target the property, with additional ones included for broader testing. Prompts are “sett”, “grass”, “cobblestone”, “earth”, “soil”, “dirt”, “sand”, “concrete”, “paving stones”, “chip seal”, “gravel”, “compacted”, “asphalt”, “concrete plates”, and “ground”.
Abstract concepts: Words that do not have a unique or physical representation are used to test the model’s responses. Prompts include anything, nothing, something, void, and thing.
For the second experiment, we conducted two tests with different strategies to compare the classification accuracy with the pre-trained models. In the first “Model Aggregation” test, we summed the class probabilities across all models to create an overall score. The class with the highest aggregated score was selected as the final prediction. This aggregation was performed internally, using all models initially before being repeated with the three best overall performers.
In the second approach, termed the “Fine-Tuning” test, we selected 60% of the samples for fine-tuning and used the remaining 200 samples per category for testing. Following the specifications in
Table 1, we chose the lightest model to evaluate the extent of improvement relative to the model size and computational burden. Previous studies [
49] showed that CLIP’s fine-tuning is highly sensitive to the optimizer’s algorithm hyperparameters. Therefore, we empirically tested multiple scenarios, presenting the worst and the best outcomes side by side. All the produced models and analyses are published in a repository at the HuggingFace community [
54] to ensure reproducibility.
4. Conclusions
This study makes key contributions to geospatial analysis and urban infrastructure management. Firstly, we created the Deep Pavements dataset, a robust and expandable dataset specifically designed for training models in standardized (OSM-compatible) pavement material classification. This open dataset is a valuable resource for future research, enabling more comprehensive training and benchmarking of algorithms in this specialized area.
Secondly, we developed a fine-tuned, lightweight CLIP-based model optimized for pavement type detection. This model demonstrates how adapting large, state-of-the-art language models to specific, real-world applications can improve accuracy and efficiency. It also highlights the potential of using LLMs for tasks that require nuanced understanding and classification, offering a more intuitive approach to complex geospatial data challenges.
While this study provides significant insights, there are limitations that highlight opportunities for future research. The model’s performance could benefit from incorporating preprocessing techniques to handle low-quality or blurry images, which are common in real-world data. Automating such preprocessing steps would make the pipeline more robust and better suited to large-scale applications. Additionally, our study focused solely on terrestrial imagery. Future research could explore the integration of satellite or sensor data to improve model accuracy and add complementary perspectives, particularly for broader urban infrastructure analysis. Further investigation into optimal train–test split ratios could also provide additional performance insights.
Our findings emphasize the importance of diversifying training datasets to improve the performance of open-vocabulary models in specialized tasks, like pavement type detection. This result aligns with broader challenges in AI, where the reliance on widely used datasets may lead to skewed or limited results in less common applications. Addressing these biases by creating more diverse and representative datasets can significantly enhance model performance and reliability.
Future research should focus on enhancing the CLIP-based model by incorporating different or additional approaches, such as data augmentation and fine-tuning with heavier versions, to compare their performance on generic tasks. Exploring robust solutions that combine specialized CLIP models with constrained vocabularies to verify hallucinations independently could further enhance the reliability of these models. Achieving a truly generalist open-vocabulary algorithm for specialized tasks, like pavement type detection, is an ambitious goal. Still, it is essential for advancing the capabilities of AI in diverse and practical applications.
In conclusion, this study contributes to developing specialized AI models for urban planning and geospatial analysis and provides valuable insights into the broader implications of dataset biases and model generalization. These contributions lay the groundwork for future advancements in creating more flexible, accurate, and representative AI models for real-world applications.