Next Article in Journal
FastTree-Guided Genetic Algorithm for Credit Scoring Feature Selection
Next Article in Special Issue
Weed Detection in Challenging Field Conditions: A Semi-Supervised Framework for Overcoming Shadow Bias and Data Scarcity
Previous Article in Journal
Error-Guided Multimodal Sample Selection with Hallucination Suppression for LVLMs
Previous Article in Special Issue
Artificial Intelligence-Driven Mobile Platform for Thermographic Imaging to Support Maternal Health Care
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Intelligent Identification of Rural Productive Landscapes in Inner Mongolia

1
Architecture College, Inner Mongolia University of Technology, Hohhot 010051, China
2
Inner Mongolia Key Laboratory of Grassland Human Settlement System and Low-Carbon Construction Technology, Hohhot 010051, China
3
Art College, Inner Mongolia Normal University, Hohhot 010028, China
*
Authors to whom correspondence should be addressed.
Computers 2025, 14(12), 565; https://doi.org/10.3390/computers14120565
Submission received: 27 October 2025 / Revised: 15 December 2025 / Accepted: 15 December 2025 / Published: 17 December 2025
(This article belongs to the Special Issue Machine Learning: Innovation, Implementation, and Impact)

Abstract

Productive landscapes are an important part of intangible cultural heritage, and their protection and inheritance are of great significance to the prosperity and sustainable development of national culture. It not only reflects the wisdom accumulated through the long-term interaction between human production activities and the natural environment, but also carries a strong symbolic meaning of rural culture. However, current research and investigation on productive landscapes still rely mainly on field surveys and manual records conducted by experts and scholars. This process is time-consuming and costly, and it is difficult to achieve efficient and systematic analysis and comparison, especially when dealing with large-scale and diverse types of landscapes. To address this problem, this study takes the Inner Mongolia region as the main research area and builds a productive landscape feature data framework that reflects the diversity of rural production activities and cultural landscapes. The framework covers four major types of landscapes: agriculture, animal husbandry, fishery and hunting, and sideline production and processing. Based on artificial intelligence and deep learning technologies, this study conducts comparative experiments on several convolutional neural network models to evaluate their classification performance and adaptability in complex rural environments. The results show that the improved CEM-ResNet50 model performs better than the other models in terms of accuracy, stability, and feature recognition ability, demonstrating stronger generalization and robustness. Through a semantic clustering approach in image classification, the model’s recognition process is visually interpreted, revealing the clustering patterns and possible sources of confusion among different landscape elements in the semantic space. This study reduces the time and economic cost of traditional field investigations and achieves efficient and intelligent recognition of rural productive landscapes. It also provides a new technical approach for the digital protection and cultural heritage transmission of productive landscapes, offering valuable references for future research in related fields.

1. Introduction

Rural productive landscapes serve as vital carriers of intangible cultural heritage and regional culture, embodying ethnic cultural memory, production knowledge, and daily lifestyles [1,2,3]. With rapid urbanisation and the persistent decline of rural populations, traditional production methods are gradually fading: many representative production spaces lie idle, undergo conversion, or are even demolished, while the physical environments that support production practices and historical memory continue to shrink [4,5]. Consequently, forms of intangible cultural heritage reliant on specific spatial environments face mounting pressures. Against the dual backdrop of rural revitalisation and cultural heritage preservation, systematically identifying and classifying rural productive landscapes holds significant practical value for understanding rural productive civilisation, strengthening local cultural identity, and supporting planning and landscape design practices [6,7].
Productive landscapes are typically understood as composite landscape units centred on agriculture, animal husbandry, fisheries and related processing activities, comprising elements such as arable land, pastures, water bodies and operational facilities [8,9]. They not only possess distinct productive functions but also concretely embody the lifestyles, customs and craft traditions of specific regions. Recent research has begun to conceptualise rural productive landscapes through semantic dimensions such as ‘productive environments,’ ‘productive tools,’ ‘productive techniques,’ and ‘productive outputs.’ This approach links visible spatial forms to underlying knowledge systems, institutional arrangements, and cultural meanings [10]. This semantic perspective helps unify scales of observation and analytical viewpoints, providing reusable conceptual tools for subsequent documentation, comparative analysis, and conservation decision-making.
Currently, research and work in the field of rural productive landscapes primarily focuses on the spatial form of landscapes, such as the visual characteristics of traditional villages and vernacular architecture, with little attention paid to the dynamic semantic features of productive landscapes—namely the interplay between production environment, production tools, production techniques, and production outputs. Methodologically, traditional studies have predominantly employed qualitative approaches such as field surveys and in-depth interviews [11,12,13,14]. These studies have yielded significant insights into the concept, typology, and value of rural productive landscapes, while emphasising the role of local knowledge and productive wisdom in rural revitalisation, intangible cultural heritage preservation, and rural tourism development. Nevertheless, such research remains largely case-based, long reliant on expert judgement, and time-consuming due to its dependence on field surveys, manual mapping, and visual interpretation. Data collection is frequently affected by factors such as transport and climatic conditions, leading to substantially increased human and material costs. Technically, there remains a scarcity of efficient, unified frameworks for systematically cataloguing and quantitatively analysing large-scale rural productive landscapes. This limitation constrains the extension of research findings to broader spatial and temporal scales [15,16].
Concurrently, computer vision and deep learning have been rapidly applied to landscape, environmental, and cultural heritage research, enabling quantitative analysis of large image datasets. Existing image-based studies broadly fall into three categories. Firstly, land use and land cover classification utilising remote sensing or aerial imagery, where texture features or convolutional neural networks (CNNs) are employed to identify arable land, woodland, water bodies, and built-up areas, supporting land management and ecological assessments [17,18]. Secondly, ground photographs are employed to classify natural and urban landscapes, such as distinguishing mountains, grasslands, water bodies, and streetscapes. This technique has been applied to landscape preference studies, visual quality assessments, and tourism recommendations [19,20]. Thirdly, deep learning models automatically identify images of cultural heritage sites and historic districts to support inventory compilation, condition monitoring, and digital exhibition [21,22,23]. These advances demonstrate that deep learning can effectively extract high-level semantic features and support the automatic recognition of diverse landscape objects across different scales and categories, exhibiting broad application potential in rural areas.
Against this backdrop, to address the practical demands of rural revitalisation and cultural heritage preservation in Inner Mongolia, this study aims to integrate domain-specific semantic frameworks with deep learning-based image classification. Taking Inner Mongolia’s rural production landscapes as our research subject, we seek to transcend mere appearance-based distinctions by explicitly modelling the internal structures of production activities and their cultural connotations. Specifically, we constructed a semantic classification framework capturing key elements such as production environments, tools, techniques, and outputs. This framework served as the conceptual foundation for dataset construction and model design. By integrating field survey photographs with carefully curated web-sourced images, we developed a semantically annotated image dataset. Building upon the ResNet50 backbone architecture, we further designed a Cultural Embedding Module (CEM), constructing a CEM-ResNet50 model. This model incorporates cultural semantic cues through cross-channel attention mechanisms, multi-scale feature fusion, and semantic mapping. We compared this model against several baseline CNN architectures under a unified training scheme. Figure 1 summarises the overall research technical roadmap of this paper.
The principal contributions of this paper are as follows:
(1) A semantic classification framework for rural production landscapes has been constructed. We propose a semantic framework comprising 24 categories, encompassing four characteristic dimensions—production environment, production tools, production techniques, and production outputs—alongside four production sectors: agriculture, animal husbandry, fisheries, and by-product processing. This framework explicitly links tangible spatial elements with intangible cultural semantics, providing a reusable conceptual and annotation foundation for subsequent rural production landscape research.
(2) A semantically annotated image dataset of Inner Mongolia’s rural production landscapes has been constructed. Based on extensive field research and meticulously curated online resources, we have developed an image dataset of Inner Mongolia’s rural production landscapes with standardised annotation guidelines. Three trained experts with backgrounds in rural landscape studies annotated the images according to the proposed framework. Inter-rater reliability was assessed using Cohen’s kappa coefficient, revealing excellent consistency. This dataset provides foundational support for the fine-grained identification and quantitative analysis of productive landscapes.
(3) The CEM-ResNet50 model incorporating cultural semantic embeddings and its comparative evaluation. We designed and implemented a CEM-ResNet50 model, which introduces a cultural semantic embedding module within the ResNet50 backbone network. Under a unified training protocol, we compared this model against VGG16, ResNet18, and ResNet50 using Top-1 accuracy, macro-average and micro-average precision, recall, and F1 scores, alongside confusion matrices and t-SNE-based feature visualisations. Results demonstrate that CEM-ResNet50 enhances fine-grained recognition performance, particularly for visually similar or sparsely sampled categories. Moreover, the generated feature space aligns more closely with the proposed semantic framework, thereby showcasing its potential for digital documentation and analysis of rural production landscapes.
The remainder of this paper is organised as follows: Section 2 reviews prior work on rural productive landscapes and image classification within landscape, environmental, and heritage studies. Section 3 introduces the study area, elaborates on the semantic classification framework, and describes the construction of the rural productive landscape image dataset. Section 4 details the convolutional neural network models (including the proposed CEM-ResNet50) and experimental setup. Section 5 reports and analyses the experimental results, encompassing classification metrics and feature space visualisations. Section 6 summarises key findings and discusses the limitations of this research.

2. Related Work

2.1. Rural Produktive Landscapes: Concepts and Traditional Studies

Rural productive landscapes, as key material carriers of agrarian civilisation and local memory, have been an important focus of rural geography, landscape studies, and rural planning since the late twentieth century. Existing research has mainly approached this topic from the perspectives of historical evolution, spatial patterns, and cultural values, and has employed methods such as literature review, field investigation, and interviews to examine the formation mechanisms and typological characteristics of productive landscapes in different regions, as well as their relationships with village spatial structures. This body of work emphasises the in-depth exploration of local knowledge and productive wisdom, and contributes to a better understanding of the integrated value of productive landscapes in the contexts of rural revitalisation, intangible cultural heritage conservation, and rural tourism development.
However, in terms of methodology, traditional studies rely heavily on on-site surveys, hand-drawn mapping, and expert interpretation. Such approaches are time-consuming and costly, and it is difficult to achieve efficient and systematic analysis and comparison when dealing with large areas and multiple landscape types. With the explosive growth of image data—particularly the widespread availability of fieldwork photographs and online images—an urgent research need has emerged: to develop ways of automatically and objectively identifying and quantifying elements of rural productive landscapes from large-scale visual data.

2.2. Image Classification in Landscape and Heritage Research

In recent years, rapid advances in computer vision and deep learning have led to the widespread application of image classification methods in landscape, environmental, and cultural heritage research. Existing studies can be broadly grouped into three categories. First, land-use and land-cover classification based on remote sensing or aerial imagery, in which traditional texture features or convolutional neural networks (CNNs) are used to identify arable land, woodland, water bodies, built-up areas and related classes, thereby supporting land management and ecological assessment. Second, classification of natural and urban landscapes from ground-level photographs, such as distinguishing mountain, grassland, water and urban street scenes, which has been widely applied to landscape preference analysis, visual ecological assessment and tourism recommendation. Third, image recognition studies targeting cultural heritage and historic districts, where deep learning models are employed to automatically identify traditional buildings, heritage landmarks or the spatial patterns of historic quarters, providing technical support for heritage monitoring and digital presentation.
Table 1 summarises and compares representative image classification studies in landscape and environmental research in recent years. It can be observed that most studies adopt classical convolutional neural networks such as VGG, ResNet, DenseNet and EfficientNet as backbone models, while some further incorporate attention mechanisms, multi-scale feature fusion or multimodal information to enhance recognition accuracy in complex scenes. These developments indicate that deep learning has clear advantages in automatically extracting high-level semantic features from landscape images and can effectively support the automatic recognition and classification of landscape objects across different scales and categories.

2.3. Research Gaps and Objectives

Research on rural productive landscapes has accumulated substantial achievements in terms of conceptual definition, typological classification, and value analysis; however, it remains relatively weak with respect to the automatic identification and quantitative analysis of large-scale visual data. Although image classification and deep learning methods have been widely applied in landscape and environmental studies and have demonstrated clear advantages in multi-scale feature extraction and scene recognition, existing research objects are predominantly natural landscapes, urban spaces, or land-use categories, and their labelling systems are largely constructed around general spatial functions and environmental classes. For rural productive landscapes—complex objects that combine production functions with rich cultural connotations—there is still a lack of dedicated semantic frameworks and supporting datasets.
Building on the above literature review, this study mainly addresses the following three research gaps:
(1) Lack of productivelandscape image datasets with standardised annotations. The label systems of currently available public datasets are mostly constructed around natural, artificial, or land-use categories and cannot directly reflect productive processes and cultural semantics, which constrains the application and evaluation of deep learning methods in this field.
(2) Insufficient integration between deep learning models and production-related semantic structures. Existing landscape image classification models generally treat landscapes as generic scene categories; their internal feature representations are difficult to relate back to the domain-knowledge framework of “productive activities–spatial environment–material outputs”, and thus are not well suited to feeding recognition results into planning, design, and cultural interpretation practices.
To address the above gaps, this study takes rural productive landscapes in Inner Mongolia as its research object and pursues the following objectives:
(1) to develop a semantic classification framework for productive landscapes that simultaneously accounts for tangible spatial elements and intangible cultural semantics.
(2) to integrate fieldwork photographs with rigorously screened online images and establish a productive landscape image dataset with a unified annotation protocol.
(3) to introduce a cultural semantic embedding module into a ResNet50 backbone to construct the CEM-ResNet50 model, to compare it with multiple CNN architectures under a unified training protocol, and—through multi-class evaluation metrics and feature-space visualisation—to examine the applicability of this approach to the recognition of rural productive landscapes.

3. Materials and Methods

This section introduces the study area as well as the sources and acquisition methods of the image data, elaborates the semantic classification framework of rural productive landscape features, and further describes the procedures for image annotation, quality control, and data partitioning. The overall methodological workflow of this study—from data acquisition, semantic framework and dataset construction, through convolutional neural network training, to result analysis and visualisation—is illustrated in Figure 2.

3.1. Study Area

Supported by the General Project of Humanities and Social Sciences of the Ministry of Education, “Investigation of Productive Landscape Heritage in Beautiful Villages of Inner Mongolia”, the research team has carried out a systematic survey of rural productive landscapes in selected banners, counties, and surrounding areas of Inner Mongolia since 2024. The study area covers typical agro-pastoral ecotones such as the Hetao Plain and the Hulunbuir Grassland, as well as representative traditional settlements and industrial spaces, thereby providing a relatively comprehensive reflection of the landscape characteristics associated with local agricultural, pastoral, fishery, and by-product processing activities.
From May to December 2024, the research team carried out several rounds of fieldwork in selected banners, counties, and surrounding areas of Inner Mongolia. Drawing on the organisational experience of previous “Beautiful Countryside” surveys, and informed by literature review and preliminary reconnaissance, representative productive landscape sites were systematically documented. In total, the present survey covers approximately 450 rural productive landscape locations, whose spatial distribution is shown in Figure 3. The documentation encompasses multiple aspects, including rural settlement patterns, productive spaces, cultural landscapes, and ecological features.

3.2. Image Data Sources and Acquisition

The rural productive landscape image dataset used in this study integrates two main sources—field survey photographs and publicly available online images—in order to balance on-site authenticity with sample diversity.
(1) Field survey images
These images were collected by the research team at the aforementioned 450 productive landscape locations and include both ground-level photographs and aerial images captured by unmanned aerial vehicles (UAVs). The shooting equipment comprised digital cameras, UAVs, and smartphones, resulting in certain differences in resolution, aspect ratio, and lighting conditions. For each field image, the shooting location, time, type of productive activity, and a brief descriptive note were recorded on site to support subsequent annotation and cross-checking.
(2) Publicly available online images
To complement typical scenes that were difficult to reproduce on site or for which only a limited number of field photographs could be obtained, this study, under copyright-compliant conditions, collected images related to rural productive landscapes in Inner Mongolia from official or semi-official online platforms. These platforms include national and local government portals; websites of cultural institutions such as culture and tourism departments and museums; special feature pages of mainstream news media; and a number of image libraries and promotional materials with explicit authorisation statements.
During the screening of online images, candidate samples were first retrieved by combining place names with keywords such as “agricultural landscape”, “pastoral landscape”, “fishery landscape”, and “traditional processing”. The shooting locations, event types, and productive activities were then cross-checked using page titles, image captions, and the main text, and images that were unrelated to rural productive landscapes or semantically ambiguous were removed. Images with unclear sources, suspected screenshots from personal social media, or with obvious copyright watermarks or restrictions were excluded from the dataset. For all retained images, the source URL and access date were recorded to facilitate subsequent traceability.
In total, more than 20,000 image samples were obtained through field surveys and online collection. After multiple rounds of manual screening and quality control, 7344 images with clear composition, salient themes, and good representation of the characteristics of rural productive landscapes in Inner Mongolia were finally selected as research samples. Due to differences in shooting perspectives, compositional conventions, and compression quality, field photographs and online images may exhibit distributional shifts. This study seeks to minimise such effects during image preprocessing and model training, and explicitly discusses them as data-related limitations in the Section 5.

3.3. Semantic Classification Framework for Rural Productive Landscapes

In the field of international cultural heritage conservation, the concept of “cultural landscape” has been employed to emphasise integrated landscape types jointly shaped by the natural environment and long-term human productive and everyday practices. In 1992, the Operational Guidelines for the Implementation of the World Heritage Convention were revised on the basis of the definition of cultural heritage in Article 1 of the 1972 World Heritage Convention, and “cultural landscapes” were formally established as an independent category on the World Heritage List, thereby highlighting the importance of the relationship between productive practices and the environment in heritage designation [34]. Building on this, the Agency for Cultural Affairs in Japan developed the protection system of “Cultural Landscapes” and “Important Cultural Landscapes”, incorporating landscapes that reflect the interrelationships among local livelihoods, land use, and regional environments into the cultural properties system [35]. Through the refinement of typological categories and selection criteria, this system has accumulated substantial experience in the survey and conservation of everyday productive landscapes.
Against this background, the systematic identification and automated analysis of rural productive landscapes at a regional scale first require the establishment of a clear semantic classification framework of characteristic features. On the one hand, such a framework provides unified descriptive standards and category systems for different types of productive scenes, material elements, and cultural representations, enabling rural productive landscapes to be accurately identified and classified within a shared conceptual structure and thereby laying a data foundation for subsequent quantitative analyses and comparative studies [36,37]. On the other hand, it offers a label system that can be directly utilised by deep learning-based automatic image recognition methods, allowing model outputs to be interpreted in relation to core characteristics such as production environment, production tools, production techniques, and production outputs. In this way, it provides reusable foundational support for research and conservation of rural productive landscapes, as well as for related fields such as environmental design and landscape planning.
The framework was constructed as follows. First, drawing on the literature on cultural landscapes and important cultural landscapes, together with records from preliminary field surveys, rural productive landscapes in Inner Mongolia were analytically deconstructed, and their main characteristics were summarised into four semantic dimensions: production environment, production tools, production techniques, and production outputs. These dimensions are used to describe, respectively, the spatial settings on which productive activities depend, the material facilities employed, the operational processes that embody regional characteristics, and the material outcomes that can be visually observed. Second, in light of the actual industrial structure of the study area, productive activities were divided into four sectors: agriculture, animal husbandry, fisheries, and by-product processing. Finally, the “four semantic dimensions” were cross-combined with the “four production sectors” to obtain 4 × 4 = 24 fine-grained categories; based on multiple rounds of trial annotation and discussion, the category names and connotations were further refined so that they can robustly cover the main types of productive landscapes observed during the surveys.
On this basis, the semantic classification framework of rural productive landscape features developed in this study divides all samples into four primary categories: Production Environment, Production Tools, Production Techniques, and Production Outputs. Within each primary category, samples are further subdivided by production sector into four subcategories—agriculture, animal husbandry, fisheries, and by-product processing—resulting in 24 fine-grained classes in total. The overall structure and label configuration are shown in Figure 4, and the specific names and sample counts of each category are listed in Table 2. All subsequent convolutional neural network training and evaluation in this study are conducted under this labelling scheme.

4. Convolutional Neural Network Models and Experimental Settings

4.1. Baseline Convolutional Neural Network Models

To evaluate the applicability of the proposed semantic classification framework of rural productive landscapes in image recognition tasks, and to assess performance changes after introducing cultural semantic embeddings, we conducted comparative experiments under a unified hardware and software environment. All experiments were run on a GeForce RTX 4090 GPU (24 GB VRAM), using PyTorch (version 2.7.1.) as the main deep learning frameworks. The training hyperparameters were set as follows: batch size = 32, initial learning rate = 0.001, number of epochs = 120, Adam optimizer, and categorical cross-entropy loss.
For model selection, VGG16, ResNet18, and ResNet50 were adopted as baseline convolutional neural networks. These architectures are representative in the field of image classification and cover a spectrum from relatively shallow networks to deeper residual networks, which facilitates a horizontal comparison of model performance under the same dataset and training protocol.
All baseline models were initialized with ImageNet-pretrained weights and adapted to the 24-class fine-grained classification task of rural productive landscapes via transfer learning. Specifically, the original final fully connected layer was replaced with a classification head with an output dimension of 24, while the remaining convolutional layers retained their pretrained parameters. During the initial training phase, a relatively small learning rate was used to jointly fine-tune the entire network, balancing convergence speed and overfitting control under limited sample conditions.

4.2. CEM-ResNet50 Model

Building on the ResNet50 baseline network, this study introduces a Cultural Embedding Module (CEM) to construct the CEM-ResNet50 model, with the aim of enhancing the network’s capacity to represent cultural features in rural productive landscapes. The CEM consists of three sequentially connected components: a Cross-channel Cultural Attention module (CCA), a Multi-scale Cultural Feature fusion module (MSCF), and a Cultural Semantic Mapping Layer (CSML). The output of the CEM is fused with the backbone features in a residual manner. The overall architecture is illustrated in Figure 5.
The CEM-ResNet50 model adopts the classical ResNet50 as its backbone network and embeds a Cultural Embedding Module (CEM) on this basis, forming a deep neural architecture tailored to the classification of traditional food-related productive landscape images [38]. As the base framework, ResNet50, with its 50-layer depth and residual connections, effectively alleviates the vanishing-gradient problem in deep network training and enables the progressive extraction of features from raw image pixels to abstract semantic representations. It consists of an initial convolutional layer, four stages of stacked residual blocks, and a final classification layer. The initial convolutional layer performs preliminary feature extraction and downsampling of the input images; the four residual stages successively capture multi-level features ranging from low-level textures to high-level semantic patterns; and the classification layer maps the extracted features to specific category labels.
However, when dealing with traditional food productive landscape images, the standard ResNet50 remains limited in its ability to mine the embedded cultural semantic information. It struggles to accurately capture the subtle cultural differences in colour, texture, and form among categories such as “white foods”, “red foods”, and production tools, which are crucial for fine-grained discrimination in this domain.
The core innovation of the CEM-ResNet50 model lies in the introduction of the CEM, which comprises three main components: Cross-channel Cultural Attention (CCA), Multi-scale Cultural Feature fusion (MSCF), and the Cultural Semantic Mapping Layer (CSML). Together, these components enhance the model’s ability to perceive and represent culturally salient features from different perspectives. The CCA module, informed by domain-specific cultural priors, dynamically adjusts channel-wise feature weights according to the typical visual characteristics of different categories in traditional food-related scenes. Specifically, when processing images belonging to “white food” categories, the CCA module automatically increases the weights of the G/B channels, enabling the model to sensitively capture visual elements closely associated with such foods, such as the milky sheen of butter or the delicate matte surface of fermented dairy products (e.g., naidoufu). In contrast, when handling “red food” images, the module emphasises the R channel, thereby highlighting characteristic traces of roasting or drying techniques, such as the reddish-brown charred patches on roasted whole lamb or the dark-red fibres on dried meat. By recalibrating channel importance in this manner, the model effectively suppresses irrelevant information and focuses on culturally distinctive visual cues, which in turn provides strong support for accurate classification.
The MSCF module is designed to address the problem of multi-scale feature variation in traditional food-related productive landscape images. It applies convolution kernels of different sizes (e.g., 1 × 1, 3 × 3, and 5 × 5) to the input features, thereby capturing information ranging from subtle local differences to global morphological patterns. Small-scale kernels focus on extracting fine-grained discriminative features, such as precisely distinguishing tiny specular highlights formed by surface fat on butter from the delicate cut-induced textures on fermented dairy products. Large-scale kernels, by contrast, are responsible for capturing global shape cues, such as outlining the overall contour and grid structure of roasting racks, or the bodily form of roasted whole lambs. Subsequently, these multi-scale feature maps are fused through a feature-pyramid-like structure, enabling the model to jointly account for both detail and overall configuration and to form a more comprehensive and discriminative representation of the image. In practical classification, when confronted with complex food-related productive landscape scenes, the model can make integrated use of these multi-scale features, thus avoiding misclassification due to insufficient information at a single scale and significantly improving both the accuracy and reliability of the predictions.
The role of the CSML module is to establish a linkage between visual features and cultural semantics, so that the model can implicitly learn the underlying cultural associations of each category even in the absence of explicit semantic annotations. Concretely, this module employs two fully connected layers to project the feature vectors—after processing by ResNet50, CCA, and MSCF—into a cultural semantic space. Within this space, the model can capture latent relationships between visual categories and culturally meaningful concepts, such as the association of “white foods” with dairy products and fermentation processes, “red foods” with roasting and meat-based dishes, and production tools with functional form and material properties.
During classification, the model thus relies not only on visual features but also on these learned semantic associations, making its decisions more consistent with human cognitive patterns regarding traditional food culture. By analysing the decision process with interpretability tools such as LIME (Local Interpretable Model-agnostic Explanations), one can clearly observe how the model activates corresponding cultural semantic tags based on specific visual cues and then arrives at a final category prediction. This procedure endows the model with good interpretability and highlights the distinctive advantage of CEM-ResNet50 in understanding and representing cultural features.
In the overall workflow, input images of traditional food-related productive landscapes are first passed through the ResNet50 backbone to obtain multi-level feature maps. These feature maps are then processed by the CCA module, which reweights channels to emphasise culturally salient cues, followed by the MSCF module, which performs multi-scale feature fusion to form more representative feature embeddings. Finally, the CSML module establishes the mapping to cultural semantics, and the resulting features are fed into the classification layer to produce the final predictions. In this way, CEM-ResNet50 effectively compensates for the limitations of plain ResNet50 in cultural scene applications, exhibiting stronger feature extraction capabilities and higher classification accuracy in the task of traditional food productive landscape image classification.

4.3. Experimental Fairness

To ensure a fair comparison among different models, this study adopts exactly the same training protocol and data partitioning strategy for VGG16, ResNet18, ResNet50, and CEM-ResNet50; the only difference lies in the network architectures themselves. All models are trained and evaluated using identical training/validation/test splits, the same preprocessing pipeline, and the same set of hyperparameters, and their performance is compared on this common basis. Specifically:
(1) Image standardisation: All images are resized to 224 × 224 pixels, pixel values are normalised to the [0, 1] range, and colour channels are converted to the RGB space, so as to mitigate the influence of different shooting devices and illumination conditions.
(2) Data augmentation: The same data augmentation strategy (e.g., random rotation, scaling, horizontal flipping) is applied to all models and all samples. This not only enhances model robustness, but also helps align the feature distributions of images from different sources.
(3) Source-balance checking: During dataset construction, the proportions of field-survey images and web-sourced images are examined and controlled within each category, with the aim of maintaining a relatively balanced source composition. This reduces the risk that samples from a particular source become overly concentrated in specific classes, and thus minimises source bias in model evaluation.

4.4. Experimental Design

In line with the research objectives, the following experiment is designed:
Baseline model comparison: Under identical data partitions and training protocols, VGG16, ResNet18, ResNet50, and CEM-ResNet50 are trained separately. Their performance on the test set is then compared in terms of Top-1 accuracy as well as macro- and micro-averaged precision, recall, and F1-score. This allows us to evaluate the impact of incorporating cultural semantic embeddings on the overall classification performance.

4.5. Multi-Class Evaluation Metrics

Top-1 Accuracy: The proportion of samples in the test set for which the predicted label exactly matches the ground-truth label. This metric reflects the overall correctness of the model’s classifications.
Macro-averaged Precision/Recall/F1: Precision, recall, and F1-score are first computed separately for each of the 24 classes and then averaged arithmetically across all classes. These macro-averaged metrics capture the model’s performance balance among different categories and are more sensitive to classes with relatively few samples.
Micro-averaged Precision/Recall/F1: True positives (TPs), false positives (FPs), and false negatives (FNs) are accumulated over all classes, and precision, recall, and F1-score are then computed based on these global counts. These micro-averaged metrics emphasise average performance at the sample level and are more strongly influenced by classes with larger sample sizes.

4.6. Feature-Space Visualisation (t-SNE)

To explore the relationship between the high-level features learned by the models and the semantic structure of rural productive landscapes, this study employs t-distributed Stochastic Neighbour Embedding (t-SNE) to perform dimensionality reduction and visualisation on the features from the penultimate layer of the convolutional networks. This layer is chosen for visualisation for two main reasons:
(1) it is typically the output of a global average pooling or fully connected layer and directly participates in the final classification decision, thus reflecting the models’ integrated representations of different categories.
(2) its feature dimensionality is moderate, containing rich semantic information while still being suitable for two-dimensional visualisation.
Concretely, a subset of samples is randomly selected from the test set, and their high-dimensional feature vectors are extracted from the trained VGG16, ResNet50, and CEM-ResNet50 models. These feature vectors are then mapped onto a two-dimensional plane using t-SNE, with data points coloured according to their ground-truth categories. By comparing the clustering and separation patterns of different models in the resulting feature spaces, we can intuitively assess the extent to which the CEM enhances the semantic structure of the learned representations, thereby complementing and constraining the interpretation of numerical evaluation results.

5. Results and Discussion

5.1. Overall Performance Comparison of Convolutional Neural Networks

To identify a suitable convolutional neural network architecture for productive landscape image classification, this study conducted comparative experiments on four representative CNN models: VGG16, ResNet18, ResNet50, and the improved CEM-ResNet50. To ensure fairness and reproducibility, all models were trained and tested under identical hardware and software conditions, with unified settings for input image size, learning rate, optimisation strategy, and other key parameters. Their performance differences in recognising complex landscape images were evaluated by comparing multiple classification metrics.
As shown in Table 3, the four models exhibit noticeable differences in loss, accuracy, precision, recall, and F1-score. ResNet50 achieves relatively stable overall performance, with an accuracy of 0.8612 and an F1-score of 0.8598, indicating strong feature extraction capability. The performance of VGG16 is slightly lower, with an accuracy of 0.8479 and an F1-score of 0.8154, suggesting a certain tendency toward overfitting under limited sample conditions, partly due to its deeper architecture. Among all models, the proposed CEM-ResNet50 achieves the best results across all metrics: the accuracy reaches 0.8847, precision and recall are 0.8912 and 0.8785, respectively, the F1-score is 0.8843, and the loss is the lowest (0.3756), indicating superior overall performance compared with the other architectures.
The training curves further show that CEM-ResNet50 converges faster during training, maintains stable performance on the validation set, and exhibits smaller fluctuations, which together suggest stronger generalisation ability in feature learning and classification. Benefiting from the improved feature fusion and inter-layer information optimisation mechanisms, the model is better able to capture semantic details in the images and achieves higher accuracy when distinguishing visually similar landscape types. Considering loss, accuracy, F1-score, and other metrics in combination, CEM-ResNet50 can be regarded as the best-performing and most suitable deep convolutional neural network model for the task addressed in this study.
A further examination of the training curves shows that ResNet50 consistently outperforms VGG16 and ResNet18 on the test set, while CEM-ResNet50 achieves an additional performance gain on this basis. As illustrated in Figure 6 and Figure 7, the accuracy curve of CEM-ResNet50 is smoother, converges more rapidly, and reaches a higher final level. Its loss curve remains below those of the other models throughout the entire training process, indicating better stability during optimisation. Taken together, these results demonstrate that CEM-ResNet50 delivers the best performance for the classification of traditional food-related productive landscapes in central Inner Mongolia and is therefore suitable to be adopted as the primary model in this study.

5.2. Training Results Based on the CEM-ResNet50 Model

From the overall per-class results of productive landscape features (Table 4), the model exhibits relatively stable performance across different categories, with all metrics remaining at a high level. For most classes, precision, accuracy, recall, and F1-score exceed 0.85, indicating good adaptability and generalisation in terms of feature extraction and semantic discrimination. Among them, categories such as roasted whole lamb (F1 = 0.9669), stone mill (F1 = 0.9565), dried curds (F1 = 0.9469), and fishing techniques (F1 = 0.9375) perform particularly well, with average effectiveness values all above 0.94. Images in these categories typically have clear structural characteristics, relatively stable background environments, and pronounced visual differences, which facilitate the model’s capture of key features for accurate recognition [39]. Several food-processing and tool-related elements, such as butter (F1 = 0.9370), mutton shaomai (F1 = 0.9365), and utensils for fermented milk liquor (F1 = 0.9265), also achieve high classification accuracy, suggesting that the model possesses strong capability for fine-grained discrimination of details within productive landscapes. Overall, the model attains a desirable balance of performance across most categories and is able to robustly recognise images of different types of rural productive activities.
By contrast, a few categories show relatively lower classification performance, such as molds for fermented dairy products (F1 = 0.7518), the Alxa desert landscape (F1 = 0.8469), and Hulun Lake in the ice season (F1 = 0.8369). For these samples, background complexity, lighting conditions, or subtle feature details introduce additional noise, leading to reduced accuracy during feature extraction. Nevertheless, the average effectiveness across all categories remains above 0.85, and more than two-thirds of the classes achieve F1-scores higher than 0.90, indicating strong robustness and reliability in recognising multiple types of productive landscape elements. These results suggest that the model not only adapts well to the diversity of rural landscape images, but also holds promise for application to larger-scale landscape feature analysis and cultural heritage identification.
The excellent performance of CEM-ResNet50 in traditional food-related productive landscape classification is closely linked to its enhanced interpretability, which derives from the systematic reinforcement of culturally salient features. From the perspective of feature enhancement, the Cross-channel Cultural Attention (CCA) module establishes a dynamic weighting mechanism informed by prior knowledge of food culture. For “white food” categories, CCA strengthens the weights of the G/B channels, enabling the model to respond specifically to the milky sheen of butter and the matte surface of fermented dairy products. When a butter image is input, for example, the model preferentially captures specular highlights produced by surface fat and the smooth boundaries at the edges—precisely the visual cues that humans rely on when identifying “butter”. For “red food” images, CCA selectively enhances the R channel, thereby accentuating the reddish-brown charred patches on roasted whole lamb and the dark-red fibre textures of dried meat. This channel-level weighting allows the model to move beyond the undifferentiated feature extraction typical of standard networks and to focus instead on visual elements that are strongly associated with specific cultural categories.
At the level of feature representation, the Multi-scale Cultural Feature (MSCF) fusion module captures cultural characteristics in a hierarchical manner through differentiated convolution strategies. Small-scale kernels (e.g., 1 × 1 and 3 × 3) are well suited to extracting fine-grained inter-class differences, such as distinguishing the tiny lipid highlights on butter from the granular cut textures on fermented dairy products. Larger kernels (e.g., 5 × 5 and 7 × 7) are responsible for encoding global shape features, including the grid structure of roasting racks and the overall body outline of roasted whole lamb [40]. After these multi-scale features are fused via a pyramid-like structure, the model can simultaneously account for fine details and global configuration when processing complex scenes. For instance, in the classification of roasted whole lamb, the model not only analyses local processing-related cues such as the density and distribution of charred patches on the skin, but also integrates global visual cues such as the curvature and posture of the carcass.
The Cultural Semantic Mapping Layer (CSML) further endows the model with semantic reasoning capabilities. Even in the absence of explicit semantic annotations, CSML guides the model—through data-driven learning—to establish implicit associations between visual features and cultural concepts within a semantic space. Taking roasted whole lamb as an example, the feature vectors processed by CSML become strongly associated with cultural labels such as “roasting techniques” and “Mongolian cuisine”. LIME-based analyses of single-sample decision paths show that the model’s predictions for the “roasted whole lamb” category rely heavily on visual cues such as the distribution of charred patches on the skin and the overall body shape of the lamb, which correspond closely to the cultural semantics of “nomadic roasting rituals”. This mapping process from visual signals to cultural concepts means that the model’s decisions are no longer confined to pixel-level pattern matching, but instead exhibit a form of associative reasoning that aligns with human interpretation of traditional food culture.

5.3. Feature Visualisation and Semantic Structure Analysis

To examine the relationship between the high-level features learned by the models and the semantic structure of rural productive landscapes, this study applies t-SNE to the features from the penultimate layer (i.e., the output of the global average pooling or fully connected layer before the classification head) [41,42]. Figure 8 presents the feature distributions of ResNet50 and CEM-ResNet50 on the test samples, with different colours representing different semantic categories.
As shown in Figure 8, both models are able to form several relatively compact clusters in the feature space, indicating that deep convolutional networks can extract high-level features with clear discriminative power. However, for some semantically similar categories, the clusters produced by ResNet50 show substantial overlap, whereas those of CEM-ResNet50 exhibit comparatively clearer boundaries. This is particularly evident in certain “Production Techniques” and “Production Outputs” categories, where intra-class compactness is enhanced and inter-class separability is improved [43].
These observations suggest that the introduction of the cultural semantic embedding module strengthens the model’s ability to distinguish between different semantic categories in high-level feature space, and that the resulting feature distribution is more consistent with the semantic classification framework developed in Section 3. It should be noted, however, that t-SNE visualisation is essentially a qualitative analysis tool and is sensitive to sample selection and parameter settings. Its results cannot be directly interpreted as evidence of strict statistical significance. Accordingly, in this study, t-SNE analysis is used only as a complement to quantitative metrics, to aid in understanding the internal representation structure of the models, rather than as a basis for strong claims about feature distribution.

6. Conclusions

Building on the proposed semantic classification framework, dataset construction, and CNN-based experiments, the main conclusions of this study can be summarised as follows:
(1) The semantic classification framework is verifiable at the image level: The 24-class semantic framework constructed by crossing the four dimensions “Production Environment–Production Tools–Production Techniques–Production Outputs” with the four production sectors “Agriculture–Animal Husbandry–Fisheries–By-product Processing” can be stably recognised at the image level by convolutional neural networks. Under a unified training setup, all four models achieve reasonably reliable classification performance, indicating that the framework is feasible for distinguishing different types of rural productive landscapes and can serve as a foundational labelling system for subsequent studies.
(2) CEM-ResNet50 outperforms baseline models in fine-grained recognition: Compared with VGG16, ResNet18, and ResNet50, the CEM-ResNet50 model achieves consistent improvements in Top-1 accuracy as well as macro-averaged precision, recall, and F1-score, and performs particularly well on small-sample or easily confused categories. These results suggest that introducing the cultural semantic embedding module into the ResNet50 backbone—via cross-channel attention, multi-scale feature fusion, and semantic mapping—enhances the model’s ability to capture fine-grained differences in rural productive landscapes.
(3) A certain degree of consistency exists between deep features and the semantic structure: t-SNE visualisation based on features from the penultimate layer shows that different categories form discernible clusters in feature space, with CEM-ResNet50 exhibiting clearer aggregation and separation for some classes than ResNet50. This indirectly supports the view that there is a non-trivial correspondence between the semantic classification framework proposed in Section 3 and the internal representations learned by the model, thereby opening up possibilities for using model outputs in productive landscape interpretation, statistical classification, and spatial analysis.
(4) The method shows application potential for digital documentation and landscape design: The combination of the semantic classification framework and the CEM-ResNet50 model enables automatic recognition and statistical analysis of large-scale rural productive landscape images at the regional level, providing data support for productive landscape surveys, environmental design, and landscape planning. The approach also has a degree of transferability and can be extended to image recognition tasks involving other culturally structured landscapes and forms of intangible cultural heritage.

Author Contributions

Conceptualization, X.T.; methodology, X.T.; software, N.L.; validation, N.L. and S.G.; formal analysis, N.L.; investigation, X.T.; resources, X.T. and C.L.; data curation, N.A. and C.L.; writing—review and editing, S.G.; visualization, N.L.; supervision, N.A.; project administration, N.L.; funding acquisition, N.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by several grants related to the study of productive landscapes and rural cultural heritage in Inner Mongolia. It was funded by the General Project of Humanities and Social Sciences of the Ministry of Education of China, titled “Investigation and Research on Productive Landscape Heritage of Beautiful Villages in Inner Mongolia” (Grant No. 24YJA760062). Additional support was provided by the Natural Science Foundation of Inner Mongolia Autonomous Region, through the project “Research on the Digital Protection and Interactive Mode of Rural Productive Landscape Heritage Information in Inner Mongolia” (Grant No. 2024LHMS05030). This work also benefited from the Key R&D and Achievement Transformation Program of Inner Mongolia Autonomous Region, under the project “Seeing the Beauty of Productive Landscapes: Development of AR-based Virtual Rural Tourism Products” (Grant No. 2022YFDZ0017). Finally, it was supported by the project “Construction System and Key Technologies of Grassland Human Settlements” (Project Approval No. YLXKZX-NGD-004).

Data Availability Statement

The datasets and models that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Bortolotto, C. From Objects to Processes: UNESCO’S ‘Intangible Cultural Heritage’. J. Mus. Ethnogr. 2007, 1, 21–33. [Google Scholar]
  2. Khalaf, R.W. A viewpoint on the reconstruction of destroyed UNESCO Cultural World Heritage Sites. Int. J. Herit. Stud. 2017, 23, 261–274. [Google Scholar] [CrossRef]
  3. Santoro, A.; Venturi, M.; Bertani, R.; Agnoletti, M. A review of the role of forests and agroforestry systems in the FAO Globally Important Agricultural Heritage Systems (GIAHS) programme. Forests 2020, 11, 860. [Google Scholar] [CrossRef]
  4. Plieninger, T.; Bieling, C. (Eds.) Resilience and the Cultural Landscape: Understanding and Managing Change in Human-Shaped Environments; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
  5. Aplin, G. World heritage cultural landscapes. Int. J. Herit. Stud. 2007, 13, 427–446. [Google Scholar] [CrossRef]
  6. Patlitzianas, K.D.; Doukas, H.; Kagiannas, A.G.; Psarras, J. Sustainable energy policy indicators: Review and recommendations. Renew. Energy 2008, 33, 966–973. [Google Scholar] [CrossRef]
  7. Hearn, K.P.; Fagerholm, N. The characterisation and future sustainability of a rural landscape: Using integrated approaches for temporal heritage landscape analysis in Northwest Spain. Landsc. Ecol. 2025, 40, 76. [Google Scholar] [CrossRef]
  8. Santoro, A.; Venturi, M.; Agnoletti, M. Agricultural heritage systems and landscape perception among tourists. The case of Lamole, Chianti (Italy). Sustainability 2020, 12, 3509. [Google Scholar] [CrossRef]
  9. Vali, A.; Comai, S.; Matteucci, M. Deep learning for land use and land cover classification based on hyperspectral and multispectral earth observation data: A review. Remote Sens. 2020, 12, 2495. [Google Scholar] [CrossRef]
  10. Zheng, Y.; Dian, Y.; Guo, Z.; Yao, C.; Wu, X. A functional zoning method in rural landscape based on high-resolution satellite imagery. Remote Sens. 2023, 15, 4920. [Google Scholar] [CrossRef]
  11. Torquati, B.; Vizzari, M.; Sportolaro, C. Participatory GIS for integrating local and expert knowledge in landscape planning. In Agricultural and Environmental Informatics, Governance and Management: Emerging Research Applications; IGI Global Scientific Publishing: Palmdale, PA, USA, 2011; pp. 378–396. [Google Scholar]
  12. Martín, B.; Ortega, E.; Otero, I.; Arce, R.M. Landscape character assessment with GIS using map-based indicators and photographs in the relationship between landscape and roads. J. Environ. Manag. 2016, 180, 324–334. [Google Scholar] [CrossRef]
  13. Srivastava, S.; Vargas Munoz, J.E.; Lobry, S.; Tuia, D. Fine-grained landuse characterization using ground-based pictures: A deep learning solution based on globally available data. Int. J. Geogr. Inf. Sci. 2020, 34, 1117–1136. [Google Scholar] [CrossRef]
  14. Clark, A.; Phinn, S.; Scarth, P. Pre-Processing training data improves accuracy and generalisability of convolutional neural network based landscape semantic segmentation. Land 2023, 12, 1268. [Google Scholar] [CrossRef]
  15. Wu, M.; Huang, Q.; Gao, S.; Zhang, Z. Mixed land use measurement and mapping with street view images and spatial context-aware prompts via zero-shot multimodal learning. Int. J. Appl. Earth Obs. Geoinf. 2023, 125, 103591. [Google Scholar] [CrossRef]
  16. Zhu, Y.; Deng, X.; Newsam, S. Fine-grained land use classification at the city scale using ground-level images. IEEE Trans. Multimed. 2019, 21, 1825–1838. [Google Scholar] [CrossRef]
  17. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  18. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  19. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  20. Feighey, W. Negative image? Developing the visual in tourism research. Curr. Issues Tour. 2003, 6, 76–85. [Google Scholar] [CrossRef]
  21. Sasithradevi, A.; Chanthini, B.; Subbulakshmi, T.; Prakash, P. MonuNet: A high performance deep learning network for Kolkata heritage image classification. Herit. Sci. 2024, 12, 242. [Google Scholar] [CrossRef]
  22. Cheng, Y.; Chen, W. Cultural Perception of Tourism Heritage Landscapes via Multi-Label Deep Learning: A Study of Jingdezhen, the Porcelain Capital. Land 2025, 14, 559. [Google Scholar] [CrossRef]
  23. Khalid, H.; Collier, M.J. Leveraging machine learning techniques for image classification and revealing social media insights into human engagement with urban wild spaces. Sci. Rep. 2025, 15, 24876. [Google Scholar] [CrossRef]
  24. Limei, N.; Dongfan, W.; Bo, Z. Landscape image recognition and analysis based on deep learning algorithm. J. Intell. Fuzzy Syst. 2025, 49, 471–481. [Google Scholar] [CrossRef]
  25. Lemenkova, P. Gathering predictors of biodiversity change and reconstructing land cover history in Central Apennines using machine learning and remote sensing data. J. Anatol. Geogr. 2025, 2, 36–47. [Google Scholar] [CrossRef]
  26. Martinez, R.M.; Baerenklau, K.A. Controlling for misclassified land use data: A post-classification latent multinomial logit approach. Remote Sens. Environ. 2015, 170, 203–215. [Google Scholar] [CrossRef]
  27. Kunwar, S.; Ferdush, J. Mapping of land use and land cover (LULC) using EuroSAT and transfer learning. arXiv 2023, arXiv:2401.02424. [Google Scholar] [CrossRef]
  28. Albert, A.; Kaur, J.; Gonzalez, M.C. Using convolutional networks and satellite imagery to identify patterns in urban environments at a large scale. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 1357–1366. [Google Scholar]
  29. Ottoni, A.L.C.; Ottoni, L.T.C. ImageOP: The Image Dataset with Religious Buildings in the World Heritage Town of Ouro Preto for Deep Learning Classification. Heritage 2024, 7, 6499–6525. [Google Scholar] [CrossRef]
  30. Djenouri, D.; Laidi, R.; Djenouri, Y.; Balasingham, I. Machine learning for smart building applications: Review and taxonomy. ACM Comput. Surv. (CSUR) 2019, 52, 1–36. [Google Scholar] [CrossRef]
  31. Wei, J.; Yue, W.; Li, M.; Gao, J. Mapping human perception of urban landscape from street-view images: A deep-learning approach. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102886. [Google Scholar] [CrossRef]
  32. Krohmer, J. Landscape perception, classification, and use among Sahelian Fulani in Burkina Faso. In Landscape Ethnoecology: Concepts of Biotic and Physical Space; Berghahn Books: New York, NY, USA, 2010; Volume 49, p. 82. [Google Scholar]
  33. Gong, S.; Zhang, L.; Zhang, J.; Duan, Y. Rural Local Landscape Perception Evaluation: Integrating Street View Images and Machine Learning. ISPRS Int. J. Geo-Inf. 2025, 14, 251. [Google Scholar] [CrossRef]
  34. Rössler, M. World Heritage cultura landscapes: A UNESCO flagship programme 1992–2006. Landsc. Res. 2006, 31, 333–353. [Google Scholar] [CrossRef]
  35. Akagawa, N. Heritage Conservation and Japan’s Cultural Diplomacy: Heritage, National Identity and National Interest; Routledge: Abingdon, UK, 2014. [Google Scholar]
  36. Wu, M.; Zhou, J.; Peng, Y.; Wang, S.; Zhang, Y. Deep Learning for Image Classification: A Review. In Proceedings of 2023 International Conference on Medical Imaging and Computer-Aided Diagnosis (MICAD 2023), Cambridge, UK, 9-10 December 2023; Su, R., Zhang, Y.D., Frangi, A.F., Eds.; Lecture Notes in Electrical Engineering; Springer: Singapore, 2024; Volume 1166. [Google Scholar] [CrossRef]
  37. Fan, Q.; Bi, Y.; Xue, B.; Zhang, M. Genetic Programming for Image Classification: A New Program Representation with Flexible Feature Reuse. IEEE Trans. Evol. Comput. 2023, 27, 460–474. [Google Scholar] [CrossRef]
  38. Masuda, T. Culture and attention: Recent empirical findings and new directions in cultural psychology. Soc. Personal. Psychol. Compass 2017, 11, e12363. [Google Scholar] [CrossRef]
  39. Chen, L.; Li, S.; Bai, Q.; Yang, J.; Jiang, S.; Miao, Y. Review of Image Classification Algorithms Based on Convolutional Neural Networks. Remote Sens. 2021, 13, 4712. [Google Scholar] [CrossRef]
  40. Zhang, C.; Bai, H.; Zhao, Y. Fine-Grained Image Classification by Class and Image-Specific Decomposition with Multiple Views. IEEE Trans. Multimed. 2023, 25, 6756–6766. [Google Scholar] [CrossRef]
  41. He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of Tricks for Image Classification with Convolutional Neural Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 558–567. [Google Scholar] [CrossRef]
  42. Van Der Maaten, L. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 2014, 15, 3221–3245. [Google Scholar]
  43. Guo, L.; Gu, X.; Yu, Y.; Duan, A.; Gao, H. An Analysis Method for Interpretability of Convolutional Neural Network in Bearing Fault Diagnosis. IEEE Trans. Instrum. Meas. 2024, 73, 3507012. [Google Scholar] [CrossRef]
Figure 1. Full text structure.
Figure 1. Full text structure.
Computers 14 00565 g001
Figure 2. Method Flowchart.
Figure 2. Method Flowchart.
Computers 14 00565 g002
Figure 3. Distribution of productive landscapes in Inner Mongolia Autonomous Region.
Figure 3. Distribution of productive landscapes in Inner Mongolia Autonomous Region.
Computers 14 00565 g003
Figure 4. Classification framework and labels for productive landscapes.
Figure 4. Classification framework and labels for productive landscapes.
Computers 14 00565 g004
Figure 5. Architecture of the CEM-ResNet50 model.
Figure 5. Architecture of the CEM-ResNet50 model.
Computers 14 00565 g005
Figure 6. Accuracy: ResNet50 (a); VGG16 (b); ResNet18 (c); CEM-ResNet50 (d).
Figure 6. Accuracy: ResNet50 (a); VGG16 (b); ResNet18 (c); CEM-ResNet50 (d).
Computers 14 00565 g006
Figure 7. Loss: ResNet50 (a); VGG16 (b); ResNet18 (c); CEM-ResNet50 (d).
Figure 7. Loss: ResNet50 (a); VGG16 (b); ResNet18 (c); CEM-ResNet50 (d).
Computers 14 00565 g007
Figure 8. Clustering diagram of semantic features of productive landscape.
Figure 8. Clustering diagram of semantic features of productive landscape.
Computers 14 00565 g008
Table 1. Overview of the application of image classification methods in landscape and environmental research.
Table 1. Overview of the application of image classification methods in landscape and environmental research.
Research CategoryReferenceMain Research Content
Cultural Heritage Image ClassificationMonuNet: a high performance deep learning network for Kolkata heritage image classification [21]Building a deep network MonuNet for Kolkata’s cultural heritage images, combined with a channel attention mechanism
Cultural Perception of Tourism Heritage Landscapes via Multi-Label Deep Learning: A Study of Jingdezhen, the Porcelain Capital [22]Multi-label deep learning identifies landscape cultural attributes, including crafts, folk customs, etc., combined with spatial analysis methods
Urban ecological perception classificationLeveraging machine learning techniques for image classification and revealing social media insights into human engagement with urban wild spaces [23]Using social media images and machine learning to classify urban wilderness images and reveal patterns of public engagement
Landscape image recognition and analysis based on deep learning algorithm [24]Detecting and Identifying Structural Features in Garden Landscapes Using SSD
Gathering predictors of biodiversity change and reconstructing land cover history in Central Apennines using machine learning and remote sensing data [25]Using machine learning to process remote sensing images for land cover classification in the Central Apennines
Remote sensing land use classificationControlling for misclassified land use data: A post-classification latent multinomial logit approach [26]Classification of land cover types in LANDSAT multi-period remote sensing images using the Mnlogit method
Mapping of Land Use and Land Cover (LULC) using EuroSAT and Transfer Learning [27]Fine-tuning VGG16 and WRN models based on transfer learning for the EuroSAT LULC classification task
Using Convolutional Networks and Satellite Imagery to Identify Patterns in Urban Environments at a Large Scale [28]Identifying usage patterns and structure in urban areas using CNN and satellite imagery
Visual identification of building componentsImageOP: The Image Dataset with Religious Buildings in the World Heritage Town of Ouro Preto for Deep Learning Classification [29]Developing the Image OP dataset to identify religious architectural elements in Brazilian World Heritage cities
Machine Learning for Smart Building Applications: Review and Taxonomy [30]Deep learning is used to classify building themes and improve the efficiency of digital archive management
Landscape perception classificationMapping human perception of urban landscape from street-view images: A deep-learning approach [31]Mapping public perception of Shanghai’s urban landscape using deep learning models to analyze street view imagery
Landscape perception, classification, and use among Sahelian Fulani in Burkina Faso [32]Examines Fulani landscape classification and cross-group terminology
Rural Local Landscape Perception Evaluation: Integrating Street View Images and Machine Learning [33]Constructing a rural landscape perception assessment model based on street view images and multi-dimensional indicators (subjective + objective)
Table 2. Productive landscape classification model database.
Table 2. Productive landscape classification model database.
Parent CategorySubcategoryFeatureTest SetTraining SetCombined Set
Production TechniquesAgricultureOat Flour Production Techniques10236138
Processing Technology of Millet19260252
Fishing IndustryFishing Skills29195386
Craftsmanship of Birch Bark Boat24681327
Animal HusbandryMongolian Milk Wine Brewing Techniques22577302
Tofu Making Techniques29492386
Production ToolsAgricultureGroove milling21970289
Stone Mill29492386
Animal HusbandryTofu Mold16565230
Milk Wine Utensils24676322
Whip for Herding Sheep22878306
Sheep Shovel18962251
Fishing IndustryBirch Bark Boat25287339
Production EnvironmentAgricultureHetao Plain23173304
Animal HusbandryAlxa Desert20168269
Hulunbuir Grassland27693369
Fishing IndustryIce Age Hulun Lake14151192
Production OutputAgricultureOat Flour Buns (Wowo)23471305
Millet Rice15655211
Inner Mongolia Pot Tea25883341
Sideline ProductionMilk Lumps27987366
Butter29789386
Animal HusbandryRoasted Whole Lamb23174305
Lamb Shaomai28597382
Table 3. Validity of five model parameters.
Table 3. Validity of five model parameters.
ModelLossAccuracyPrecisionRecallMicro F1Macro F1
ResNet500.40140.86120.86670.85370.85980.8294
VGG160.46840.84790.82680.82690.81540.7887
ResNet180.54390.80310.80980.79760.81330.7731
CEM-ResNet500.37560.88470.89120.87850.88430.8421
Table 4. Performance of CEM-ResNet50 on productive landscape feature recognition.
Table 4. Performance of CEM-ResNet50 on productive landscape feature recognition.
Feature SetPrecisionAccuracyRecallF1 ScoreAverage Effectiveness
Oat Flour Production Techniques0.8920.8730.8510.86180.8694
Processing Technology of Millet0.9140.8920.8830.88750.8941
Fishing Skills0.9510.9420.9330.93750.9409
Craftsmanship of Birch Bark Boat0.9230.9010.8940.89750.9039
Mongolian Milk Wine Brewing Techniques0.9020.8840.8710.87740.8836
Tofu Making Techniques0.9320.9130.9020.90750.9136
Groove milling0.9340.9210.9120.91650.9209
Stone Mill0.9730.9620.9510.95650.9606
Tofu Mold0.7820.7630.7410.75180.7594
Milk Wine Utensils0.9430.9310.9220.92650.9306
Whip for Herding Sheep0.9310.9220.9130.91750.9209
Sheep Shovel0.8820.8630.8510.85690.8632
Birch Bark Boat0.9210.9020.8930.89750.9034
Hetao Plain0.9330.9220.9110.91650.9206
Alxa Desert0.8720.8530.8410.84690.8532
Hulunbuir Grassland0.9120.9010.8930.8970.9008
Ice Age Hulun Lake0.8620.8430.8310.83690.8432
Oat Flour Buns (Wowo)0.9420.9330.9210.92690.9307
Millet Rice0.8730.8520.8430.84740.8539
Inner Mongolia Pot Tea0.9410.9320.9230.92750.9309
Milk Lumps0.9620.9530.9410.94690.9507
Butter0.9520.9410.9330.9370.9408
Roasted Whole Lamb0.9820.9730.9610.96690.9707
Lamb Shaomai0.9530.9420.9310.93650.9406
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tian, X.; Li, N.; Ai, N.; Gao, S.; Li, C. Intelligent Identification of Rural Productive Landscapes in Inner Mongolia. Computers 2025, 14, 565. https://doi.org/10.3390/computers14120565

AMA Style

Tian X, Li N, Ai N, Gao S, Li C. Intelligent Identification of Rural Productive Landscapes in Inner Mongolia. Computers. 2025; 14(12):565. https://doi.org/10.3390/computers14120565

Chicago/Turabian Style

Tian, Xin, Nan Li, Nisha Ai, Songhua Gao, and Chen Li. 2025. "Intelligent Identification of Rural Productive Landscapes in Inner Mongolia" Computers 14, no. 12: 565. https://doi.org/10.3390/computers14120565

APA Style

Tian, X., Li, N., Ai, N., Gao, S., & Li, C. (2025). Intelligent Identification of Rural Productive Landscapes in Inner Mongolia. Computers, 14(12), 565. https://doi.org/10.3390/computers14120565

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop