Cultural Symbol Preferences of Visitors to Historical and Cultural Heritage Buildings: A Case Study of the Yellow Crane Tower Based on Social Media Data and Deep Learning

Li, Liyuan; Zhang, Changzhi; Wang, Yibei; Lueng, Zack

doi:10.3390/buildings16081636

Open AccessArticle

Cultural Symbol Preferences of Visitors to Historical and Cultural Heritage Buildings: A Case Study of the Yellow Crane Tower Based on Social Media Data and Deep Learning

by

Liyuan Li

^1,2,

Changzhi Zhang

^1,2,*,

Yibei Wang

^1,2 and

Zack Lueng

³

¹

School of Art and Design, Wuhan University of Science and Technology, Wuhan 430081, China

²

New Rural Construction Research Center, Wuhan University of Science and Technology, Wuhan 430081, China

³

Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China

^*

Author to whom correspondence should be addressed.

Buildings 2026, 16(8), 1636; https://doi.org/10.3390/buildings16081636

Submission received: 31 March 2026 / Revised: 15 April 2026 / Accepted: 15 April 2026 / Published: 21 April 2026

(This article belongs to the Section Architectural Design, Urban Science, and Real Estate)

Download

Browse Figures

Versions Notes

Abstract

Against the backdrop of expanding digital dissemination and experiential transformation in cultural heritage, visitors’ visual attention and symbolic choices increasingly shape heritage cognition and value transmission. Taking the Yellow Crane Tower as a case study, this research constructs a cultural symbol recognition dataset based on visitor-shared social media images and develops an enhanced ResNet-50 model for multi-label analysis. By integrating attention mechanisms and regularisation strategies, the model improves its capacity to capture complex cultural imagery, achieving a macro F1 score of 72.70% and a micro F1 score of 81.05% on the test set, indicating strong generalisation performance. The results reveal a significant imbalance in visual preferences: landmark symbols centred on the main architectural structure dominate at 32.95%, whereas culturally informative elements such as signage, cultural products, and interpretive facilities each account for less than 5%. Tag co-occurrence analysis further identifies three image production patterns: commemorative presentation, contextual documentation, and detail-oriented cultural photography reflecting different levels of heritage perception. Rather than directly proposing prescriptive strategies, the findings provide an empirical basis for informing future interventions aimed at shifting from landmark-focused viewing to deeper cultural perception. In this way, the study contributes to heritage display optimisation and research on visitor visual behaviour.

Keywords:

deep learning; Yellow Crane Tower; cultural symbol recognition; multi-label classification; visual communication; ResNet-50

1. Introduction

Against the backdrop of a global shift in heritage conservation philosophy from ‘physical preservation’ to ‘value dissemination’ and ‘public engagement’, the modes of presentation and pathways of perception for historic cultural heritage buildings are undergoing profound transformation [1,2,3]. As vital spatial carriers of regional historical memory and cultural identity, historic buildings possess not only material form value but also participate in the ongoing production and reconstruction of socio-cultural meaning through symbolic systems [4].

With the integration of cultural tourism and advancements in digital communication technologies, public engagement with heritage extends beyond physical site visits. Social media imagery, short videos, and online visuals enable processes of ‘secondary perception’ and ‘re-dissemination’ of cultural heritage [5,6]. In this context, the visual content that visitors choose to photograph and share constitutes a form of photographic representation through which certain cultural elements are selectively foregrounded. These visual choices may reflect visitors’ perceptual preferences and, to some extent, influence how heritage is represented and communicated in digital environments, rather than directly determining its cultural value.

Existing heritage studies predominantly explore aspects such as value assessment, conservation strategies, spatial form analysis, and exhibition design, establishing a relatively systematic theoretical framework for heritage ontology and management mechanisms [7,8,9]. However, research into how visitors perceive heritage symbols and which cultural imagery they favour remains comparatively underdeveloped relative to the focus on heritage ontology. Particularly within the contemporary context of highly mediated visual communication, visitors are no longer passive recipients of heritage information. Through acts of photographing, editing, and sharing, they actively participate in the reproduction of heritage meaning. The symbolic choices revealed in visitor imagery reflect not only individual aesthetic preferences but also the combined effects of cultural cognitive structures, spatial experience pathways, and the efficacy of display systems. Therefore, from the perspective of digitally mediated visitation, analysing the cultural symbol preference structures embedded in user-generated visual content offers a specific yet valuable lens through which to examine the alignment between heritage presentation and visitor perception—particularly regarding how heritage narratives are consumed and reconfigured within social media environments. Concurrently, the rise of social media platforms has provided an unprecedented data foundation for heritage perception research [10]. The vast volume of visual content posted by visitors on these platforms authentically documents their visual focal points and emotional expressions during the tour. Compared to traditional questionnaire or interview data, this user-generated visual data offers greater scale, immediacy, and contextual authenticity, providing a novel research pathway for quantifying visitors’ cultural perceptions. In recent years, scholars have begun utilising social media data to explore tourism image perception, landscape preferences, and spatial hotspot distributions. However, within the field of historical and cultural heritage architecture, related research remains largely confined to textual commentary or geolocation check-ins, with insufficient exploration of the cultural symbolic information embedded within visual imagery [11,12].

With the rapid advancement of computer vision and deep learning technologies, image semantic recognition capabilities have significantly improved, providing methodological support for the automated analysis of cultural symbols. Convolutional neural networks demonstrate superior performance in extracting complex visual features, enabling the identification of architectural elements, landscape classification, and the interpretation of cultural imagery [13,14]. Particularly within multi-label recognition frameworks, models can simultaneously capture the coexistence relationships among multiple types of cultural symbols within an image, thereby approximating authentic heritage perception scenarios. However, existing research predominantly focuses on technical applications such as architectural style recognition, artefact classification, or heritage damage monitoring. Studies applying deep learning to ‘visitor cultural symbol preference analysis’ remain scarce, with an evident disconnect between technical methodologies and heritage perception theory [15,16,17].

Consequently, this paper shifts the research perspective from ‘heritage object recognition’ to ‘visitor symbol perception and preference analysis,’ attempting to construct an interdisciplinary pathway by integrating artificial intelligence technology with cultural heritage communication studies. Through systematic collection and analysis of social media visual data, it identifies the structural patterns of cultural symbol selection in visitors’ visual records of historical and cultural heritage architecture, thereby revealing the underlying perception mechanisms and communication logic. This research not only addresses the deficiency of traditional heritage studies, which prioritise conservation over perception, but also provides data-driven support for heritage exhibition design and cultural tourism experience optimisation.

Taking the Yellow Crane Tower, a landmark of Chinese historical and cultural significance, as a case study, this research explores visitors’ cultural symbol preferences in historical heritage architecture, leveraging the site’s dual attributes of landmark recognisability and profound cultural heritage. Utilising visual image data shared by visitors on social media platforms, a cultural symbol recognition dataset was constructed. This study employs an enhanced ResNet-50 deep learning model for multi-label cultural symbol recognition. Building upon the standard residual architecture, the model integrates a multi-scale feature fusion mechanism and incorporates channel and spatial attention modules to re-weight feature importance and enhance visually salient regions. The model is used for multi-label symbol recognition analysis. Building upon the recognition results, frequency statistics and co-occurrence analysis methods are combined to quantify visitor cultural symbol preference structures and summarise their image production patterns. The findings reveal a pronounced ‘landmark gazing’ tendency in visitor visual documentation, where symbols centred on the main architectural structure dominate, while auxiliary symbols bearing cultural interpretation and narrative functions receive relatively insufficient attention. This leads to superficial expressions of heritage cultural connotations during dissemination. Consequently, the study proposes enhancement strategies across three dimensions: optimising display systems, strengthening cultural narratives, and introducing interactive media interventions. These aim to guide visitors from singular, check-in-style viewing towards multidimensional cultural perception. Overall, this research integrates deep learning-based image recognition with the study of visitor perception in cultural heritage contexts. It establishes a data-driven approach to analysing cultural symbol preferences in social media imagery, providing an empirical basis for understanding how visual representation may influence heritage communication, while acknowledging that such representations constitute only one dimension of broader cultural value construction.

2. Literature Review

2.1. A Study on Cultural Symbols in Heritage Architecture

Cultural symbols constitute a system of meaning formation developed by human societies over prolonged historical evolution, serving as vital mediators linking material forms to cultural cognition [18]. Within heritage studies, symbols are regarded not only as significant carriers of heritage value but also as pivotal pathways for the public to comprehend historical and cultural significance [19]. Historical and cultural heritage buildings, as cultural entities possessing physical form, embody specific historical contexts and cultural connotations through their spatial structures, morphological characteristics, and decorative elements. They continuously participate in the production and dissemination of socio-cultural meaning via symbolic expression. Consequently, analysing historical architectural heritage from a cultural symbol perspective has become a significant direction in heritage value interpretation and presentation research. Theoretically, cultural symbol studies primarily originate from the theoretical framework of semiotics [20]. Within urban and spatial studies, Kevin Lynch’s ‘Image of the City’ theory further elucidates the landmark function of architectural symbols in public perception [21]. Highly recognisable historic buildings often become pivotal nodes in the urban image through visual prominence, subsequently forming stable symbolic representations within collective memory. Concurrently, Lefebvre’s theory of spatial production asserts that space is not merely a physical entity but the outcome of social relations and cultural meanings. Historical heritage buildings undergo this transformation from physical structures into cultural symbols through the interplay of ‘material space–representational space–lived space’ [22].

Building upon this theoretical foundation, academic discourse has explored the symbolic constitution of historical and cultural heritage buildings from multiple dimensions. Relevant research initially examines the historical and social significance of heritage symbols. Wollentz observes that war and conflict can damage cultural symbols and the collective memory embodied within architectural heritage, underscoring the importance of identifying and preserving their symbolic value in conservation practices [23]. Wang, adopting a geographical perspective, posits that regional characteristics and ethnic cultures jointly shape the symbolic language of architectural expression, with the concept of ‘geo-architecture’ further elucidating the profound influence of environmental context on heritage symbol formation [24]. Within the cultural tourism context, scholars focus on the communicative function of heritage architecture as a vehicle for cultural narratives. Qiu et al. argue that material spaces and intangible cultural elements jointly constitute the core of tourism experiences, reinforcing public recognition of local cultural symbols [25]. Similarly, Cao et al., through research on ancient theatre architecture, reveal the crystallisation of regional artistic styles and folk culture within architectural forms, emphasising the need to integrate tangible structures with intangible cultural elements in conservation [26]. Bahauddin employs phenomenological methods to interpret the mechanisms of symbolism and sacred space construction in religious architecture through the lens of ‘sense of place,’ thereby deepening the spiritual and cultural dimensions of architectural symbols [27]. With technological advancements, cultural symbol preservation increasingly integrates digital monitoring and spatial analysis methods. Moh’d Rababeh et al. proposed intelligent management through digital heritage technologies [28]. Moreover, computational methods have been applied to cultural symbol reproduction, such as Zhang et al.’s use of deep learning for style transfer in digital art, expanding pathways for visual symbol preservation [29]. Within the sustainability paradigm, cultural symbol research has extended to architectural renewal and environmental adaptation. Blagojević et al. advocated integrating modern sustainable technologies into traditional building conservation, noting that cultural symbols possess not only aesthetic significance but can also perpetuate their value through functional optimisation [30]. Feng et al., using traditional villages as case studies, explored the cultural symbolic attributes of architectural colour systems, demonstrating that colour, as a visual symbol, plays a crucial role in reinforcing regional identity and historical continuity [31].

Although existing research has yielded rich findings regarding symbolic composition types, connotative meanings, and spatial expressions, several limitations persist overall. Firstly, most studies adopt an expert-interpretive perspective, emphasising textual analysis of symbols’ historical origins and cultural allusions while lacking empirical analysis of public perception dimensions. Secondly, traditional symbol studies predominantly rely on qualitative analysis and case induction, with relatively insufficient quantitative identification and structural measurement methods, making it difficult to reveal differences in attention levels among various symbols during actual dissemination. Thirdly, with the rise of visual communication on social media, the presentation of historical and cultural heritage symbols has shifted from a single display system to spontaneous public production. However, related research has paid limited attention to the structural selection of symbols within visitor imagery.

Therefore, within the contemporary context of deep integration between digital communication and visual media, it is imperative to undertake a quantitative analysis of the structural presentation of cultural symbols in heritage architecture. This analysis should be grounded in visitors’ visual perceptions and integrated with large-scale image data and intelligent recognition technologies. Such an approach not only deepens the public perception dimension within cultural symbol research but also provides data-driven support for subsequent heritage display optimisation and communication strategy formulation.

2.2. Research on Visitor Perception and Cultural Symbol Preferences

As heritage conservation shifts from a focus on ‘preserving physical entities’ to ‘public engagement and value dissemination’, visitor perception has emerged as a crucial dimension in heritage studies. Historical and cultural heritage buildings are not merely static, preserved physical structures; they are cultural symbolic spaces that undergo continuous reinterpretation and reconstruction through public observation, experience, and dissemination. Visitors participate in the production of heritage meaning through their touring behaviour, visual focus, and photographic documentation. Their perceptions and preferences significantly influence the pathways through which heritage images are socially disseminated. Consequently, examining the perceptual structures and selection mechanisms of cultural symbols from the visitor’s perspective has emerged as a vital direction for deepening heritage communication research.

Theoretically, visitor perception studies primarily stem from the Tourist Gaze theory. Urry posits that tourists develop specific modes of viewing during travel, shaped by socio-cultural constructs, leading to prioritised attention towards visually prominent and symbolically significant objects [32]. Within the context of historical and cultural heritage, architectural elements that possess landmark characteristics or symbolic significance frequently become the focal point of visitors’ visual attention, being subsequently and repeatedly reinforced in photographic documentation. Furthermore, research on place attachment indicates that visitors’ emotional resonance and cultural identification influence their level of attention towards heritage symbols, rendering visual selection value-oriented [33].

Existing research on visitor perception and cultural symbol preferences reveals that cognition of cultural and natural sites is a multidimensional process shaped by psychological, social, and cultural factors. Dake (1991), drawing on risk perception orientation, highlighted how personality traits, political ideologies, and cultural biases influence individual and collective cognition, providing a foundational psychological perspective for understanding visitor interpretation of cultural symbols [34]. Kongprasert et al. examined customers’ perceptions of Thai cultural identity, employing affective design methodologies to understand how products fulfil customer expectations and requirements [35]. Building upon this, Shao et al. conducted cross-cultural comparisons and revealed significant differences in architectural heritage symbol comprehension among visitors from distinct regions (mainland China, Hong Kong/Macau/Taiwan, East/Southeast Asia, and the West), demonstrating that cultural backgrounds profoundly influence symbolic cognition and aesthetic judgements [36]. Jiang further employed machine learning to analyse museum online review data, revealing that visitor spatial preferences and cultural perceptions exhibit dynamic evolutionary characteristics, with perceptions evolving continuously through accumulated experiences [37]. Within natural and ecological tourism contexts, Aryal et al. demonstrated that visitors’ and residents’ perceptions of ecotourism value are closely linked to livelihoods and environmental conservation benefits; such symbolic and value perceptions influence public participation and conservation support [38]. Huang et al. employ social media visual sentiment analysis to demonstrate that aesthetic imagery and emotional symbols serve as crucial mediators for visitor-established place attachment [39]. Liu et al. explore the relationship between metaphorical design based on traditional cultural symbols, customer experience, and cultural identity [40]. Religious tourism research further indicates that landscape sequences and spatial rituality significantly shape visitor experiences and symbolic cognition, with combined photographic and questionnaire methods enhancing the explanatory power of human–landscape relationships [41]. Furthermore, complementary environmental safety dimensions indicate that factors such as lighting conditions influence overall visitor perception and revisit intentions, constituting vital components of cultural heritage site experiences [42,43]. In summary, visitor preferences for cultural symbols are shaped not only by cultural backgrounds and psychological structures but also by the interplay of spatial experiences, ecological values, visual affect, and safety perceptions. These elements collectively form the key mechanisms underpinning heritage site evaluation and behavioural decision-making. Tao et al. employed BERT-BiLSTM-Attention for sentiment analysis of social media reviews of a Hangzhou historic district, finding positive views on architecture and heritage but dissatisfaction with commercialisation and accessibility [44].

Although existing research has revealed visitors’ visual preferences from perspectives of tourism behaviour and communication studies, several shortcomings remain. Firstly, in terms of data types, most studies still rely predominantly on textual reviews and check-in data, with insufficient exploration of the deeper semantic information within image content. Secondly, regarding research methodologies, visual data analysis often remains confined to manual classification or landscape aesthetic evaluations, lacking automated and scalable identification techniques. Thirdly, existing research predominantly focuses on tourism imagery or landscape preferences, with limited attention paid to the structure of cultural symbols within the context of historical and cultural heritage architecture. Consequently, a systematic framework for analysing symbol preferences has yet to be established. Therefore, against the backdrop of rapidly accumulating social media visual data and the continuous maturation of computer vision technology, it is necessary to introduce deep learning image recognition methods. This enables multi-label automated parsing of cultural symbols within visitor imagery, thereby quantitatively revealing their preference structures and co-occurrence relationships. This not only enhances the precision of visitor perception research but also provides data support for subsequent heritage display optimisation and cultural dissemination strategy formulation.

2.3. Research on Cultural Heritage Image Recognition Based on Deep Learning

With the advancement of digital heritage preservation and smart tourism, the acquisition of cultural heritage information is gradually shifting from traditional manual documentation to automated recognition based on computer vision [45,46]. Among these, deep learning techniques, particularly convolutional neural networks (CNNs), have become central to cultural heritage image recognition research due to their significant advantages in feature extraction and image classification. Early studies predominantly relied on handcrafted features such as SIFT and HOG for identifying heritage building facades, patterns, and decorative elements, yet their recognition accuracy proved limited under complex backgrounds, varying lighting conditions, and scale differences. The introduction of deep learning has substantially enhanced the capability for analysing cultural heritage images. Scholars have constructed heritage architectural image datasets and employed classic CNN models, such as AlexNet, VGG, ResNet, and Inception, to conduct heritage type classification and style recognition studies, achieving a methodological shift from ‘feature engineering-driven’ to ‘data-driven learning’ [47].

At the theoretical and technical framework level, Paolanti et al. note that Pattern Recognition (PR) forms a crucial foundation for cultural heritage image analysis, emphasising machines’ capabilities in environmental parsing, pattern differentiation, and decision support. Deep learning, as a key branch of PR, has significantly advanced complex image classification and retrieval technologies [48]. In specific heritage applications, Yang et al. constructed a specialised dataset for Thangka paintings and proposed machine vision-based matching rules, validating deep learning’s applicability for identifying niche cultural artefacts [49]. Similarly, Gao et al. incorporated attention mechanisms in classifying images of overseas Chinese architectural heritage, achieving a model average precision (mAP) of 76%. This demonstrates deep learning algorithms’ capacity to effectively capture complex architectural forms and detailed features, providing technical support for cultural data preservation and retrieval [50]. Beyond architectural form recognition, deep learning has also been extended to the analysis of textual and symbolic heritage. Liu et al. proposed a semi-supervised self-training method for oracle bone script recognition. By learning structural correlations across different fonts, this approach enables efficient deciphering of ancient characters, demonstrating deep learning’s potential in historical artefact interpretation [51]. Regarding heritage presentation and dissemination, scholars like Liu integrated deep learning into the interactive interface design for Chongqing’s intangible cultural heritage. By optimising visual perception and system usability, they enhanced public engagement experiences, highlighting the application value of intelligent technologies in cultural transmission scenarios [52]. Concurrently, public-participation heritage image-sharing initiatives provide foundational data for deep learning applications. Azizifard et al. examined crowdsourced platforms like ‘Wiki Loves Monuments’, highlighting the critical role of community-contributed imagery in heritage documentation while noting that deep learning algorithms can further enhance image content recognition efficiency and accuracy [53]. Addressing complex environmental challenges in heritage building identification, scholars, including Folino, employed CNNs to classify architectural site imagery. This approach effectively overcame variations in preservation status, lighting conditions, and background interference, validating deep learning’s reliability for heritage recognition in real-world scenarios [54]. From an interdisciplinary perspective, the application boundaries of deep learning in cultural heritage continue to expand. Gîrbacia’s review systematically mapped diverse application pathways for artificial intelligence in this field [55]; Tian et al. explored practical frameworks for digital technology in regional heritage conservation and development by constructing datasets of rural and productive landscapes [56]; Ju employed scientometric analysis to reveal macro-level trends and research hotspots in cultural heritage image recognition technology, noting its gradual expansion into broader domains, such as ecological monitoring and social governance, and demonstrating sustained research potential [57]. Kumar et al. used CNNs to automatically identify cultural heritage and damage in social media images, demonstrating effective filtering of post-disaster heritage photos with reduced manual effort [58]. Viñals et al. deployed person-counting cameras and proxemic thresholds in a Valencia historic street to monitor visitor density in real time, providing a smart tool for proactive crowd management [59].

Overall, research into deep learning-based image recognition for cultural heritage has evolved from early single-classification tasks focused on architectural typology and style identification [17]. It has progressively expanded into more culturally oriented recognition dimensions, such as cultural symbol extraction, decorative semantic analysis, and heritage value representation [58]. Advances in related technologies now enable researchers to automatically capture key imagery and symbolic elements of heritage buildings from vast visual datasets, providing a methodological foundation for the quantitative analysis of cultural symbols. Particularly against the backdrop of rapidly proliferating social media imagery, user-generated content (UGC) has emerged as a vital data source for observing public perceptions of heritage and symbolic preferences [60]. However, existing research predominantly focuses on identifying heritage entities and conservation monitoring, with limited exploration of the presentation structures and attention differences in cultural symbols from a ‘visitor visual production’ perspective [61]. Systematic recognition of multi-label symbolic systems also remains relatively underdeveloped. Therefore, introducing deep learning image recognition methods into the context of visitor imagery to construct a multi-label recognition framework for cultural symbols not only helps reveal the visual representation patterns of heritage architecture within contemporary communication environments but also provides new technical pathways and empirical support for understanding visitor cultural perception mechanisms and optimising heritage display and communication strategies.

3. Methodology

3.1. Study Site and Data Collection

This study selected the Yellow Crane Tower scenic area in Wuhan, Hubei Province, China, as its empirical research site (see Figure 1). Originally built in 223 AD during the Three Kingdoms period and situated on Snake Hill along the southern bank of the Yangtze River, the Yellow Crane Tower serves as a vital nexus of Yangtze River culture, Chu culture, and Wuhan’s urban heritage, embodying outstanding historical architectural value and profound cultural symbolism. As a ticketed attraction with a structured spatial layout and defined visitor circulation, it maintains a relatively stable visitor demographic. The tower’s architectural form, historical narratives, and poetic associations constitute a highly composite cultural symbol system, granting it exceptional distinctiveness and recognizability among China’s traditionally renowned towers. Although the site has experienced short-term visitor surges during major holidays, for instance, daily counts exceeding 50,000 during the 2025 National Day and Mid-Autumn Festival period, the present study relies on long-term, large-scale user-generated content (UGC) from online platforms rather than on site observations of specific dates and thus is not directly contingent upon average daily visitation figures. Nevertheless, extreme crowding during peak periods may constrain opportunities for in-depth appreciation and influence the types of images captured and shared, a limitation acknowledged herein. Concurrent with steady regional tourism growth and a marked rise in international visitors, initiatives such as the night tour program, characterised by distinctive lighting and atmospheric presentation, have further shaped visitors’ visual perception and image production. At the urban scale, Wuhan’s expanding tourism market underscores the Yellow Crane Tower’s pivotal role in city branding and cultural consumption. Transcending its physical monument status, the tower functions as a multifaceted cultural icon whose imagery is continuously reproduced and circulated across social media and promotional platforms, generating a diverse and symbolically layered visual dataset. Consequently, the site’s high landmark recognition, dense cultural symbolism, and frequent image dissemination render it an ideal case for examining visitor preferences for cultural symbols in heritage architecture and for uncovering phenomena such as ‘landmark gaze’ and potential cultural dilution in contemporary heritage communication.

In recent years, the rapid expansion of the internet has spurred considerable academic interest in large-scale big data and social media datasets [62,63,64]. In 2024, the top three Chinese travel apps were Ctrip Travel, Qunar Travel, and Tongcheng Travel, reflecting their widespread influence and popularity [65]. Accordingly, this study utilised Python 3.12-based web scraping techniques to collect user-generated content from Ctrip Travel, one of China’s most widely adopted tourism platforms. The data collection process focused on the review section of the Yellow Crane Tower scenic area, where a customised crawler automatically retrieved user-uploaded images and their accompanying textual comments from publicly accessible pages. Due to platform-imposed restrictions, Ctrip permits access to a maximum of 300 pages of review records; consequently, the dataset was assembled by extracting entries sequentially backwards from the most recent available date (21 October 2025) to the earliest accessible page within this limit. As the platform does not explicitly disclose the exact temporal endpoint of the final page, the resulting dataset represents a continuous but platform-defined temporal window rather than a precisely bounded interval. In total, 17,669 user-generated image comments were collected. These voluntarily uploaded images capture a wide range of visual perspectives of the scenic area. However, it is important to note that the platform provides no metadata regarding shooting distance, camera angle, or precise spatial coordinates, precluding any explicit delineation of the spatial scope of image capture. All images pertain to the Yellow Crane Tower scenic area, a semi-open heritage environment encompassing both outdoor landscape spaces and interior exhibition zones. Additionally, while each entry includes a timestamp, the platform does not supply corresponding visitor flow data (e.g., real-time crowd density), thereby making it impossible to reliably distinguish between peak and off-peak visitation periods or to perform comparative analyses contingent on visitor density. Notwithstanding these constraints, the dataset affords a large-scale and ecologically valid representation of visitor-generated visual content, and the study prioritises the identification of general patterns in cultural symbol perception and visual preference over fine-grained spatiotemporal behavioural distinctions. The data collection process was conducted in strict compliance with research ethics and relevant regulations, including the exclusive use of publicly accessible data, respect for intellectual property rights, and the restriction of data usage solely to academic research purposes.

3.2. Research Framework Description

This study established an integrated research framework comprising ‘cultural symbol classification, multi-label cultural perception model construction result analysis and strategy enhancement’. Firstly, building upon cultural symbol theory, a classification system encompassing architectural cultural symbols, historical cultural symbols, natural environment symbols, and cultural and creative symbols was developed to delineate the typological composition and content boundaries of the Yellow Crane Tower’s cultural symbols. Subsequently, a multi-label cultural perception model was developed using deep learning methodologies. This involved image data scraping and collection from the Ctrip platform, manual annotation and data cleansing, combined with model training on an enhanced ResNet-50 network to achieve automatic recognition of multi-category cultural symbols. Following model training, indicator analysis commenced from prediction confidence levels, calculating key metrics including average probability, standard deviation, weighted counts, minimum probability, and maximum probability of recognition results to ensure model output reliability and interpretability. Finally, based on the proportion of cultural symbols identified by the model and their presentation characteristics, a systematic analysis of the current status of the Yellow Crane Tower cultural symbol was conducted. Targeted strategies for enhancing cultural symbols were proposed, providing methodological support and practical reference for the digital perception of historical and cultural heritage and the enhancement of symbolic value (see Figure 2 for details).

3.3. Cultural Symbol Classification System

During the data preprocessing stage, images in which human figures or faces constituted the dominant visual subject were excluded. This filtering strategy was adopted to ensure that the dataset primarily reflects architectural features and cultural symbols, thereby maintaining alignment with the research focus.

To systematically deconstruct and examine the visual representations of the Yellow Crane Tower as a cultural complex, this study constructs a multi-tiered classification system of cultural symbols based on its physical form, cultural connotations, and relevant research findings [66,67,68,69]. This framework first encompasses its core architectural structure, including intricate roof and eaves, ancillary structures, and other physical elements. Secondly, it incorporates interior decorations and furnishings reflecting the internal cultural ambience, alongside poetic inscriptions and plaques bearing historical context. Furthermore, it integrates historical–cultural symbols, natural environment symbols (such as the Yangtze River and the park’s water features), and landscape features to reflect its interaction with both natural and human environments. Finally, the framework extends to the macro-level urban skyline and meso-level parkland vistas while also encompassing contemporary cultural dissemination and experiential dimensions through cultural and creative products, distilled cultural symbols, and functional wayfinding signage. This comprehensive classification system aims to fully encompass all symbolic elements of Yellow Crane Tower, from traditional material heritage to modern cultural innovation, laying the groundwork for subsequent precise identification and in-depth analysis (see Table 1 for details).

3.4. Multi-Label Cultural Perception Model

3.4.1. ResNet-50 Model Overview

ResNet-50 (Residual Network-50) is a deep convolutional neural network architecture proposed by He et al. in 2015, and is a key representative of the classic ResNet series [70]. Its core innovation lies in the introduction of the residual learning mechanism. As shown in Figure 3, by incorporating identity shortcut connections into the network, it effectively mitigates the issues of gradient vanishing and degradation that arise as the network deepens, enabling the network to be trained stably on structures comprising hundreds or even thousands of layers. The network comprises five main stages (Stages 0 to 4). Stage 0 serves as the initial processing stage and has a relatively simple structure, whilst the subsequent four stages consist of bottleneck (BTNK) residual blocks. These residual blocks train deep neural networks by learning residual functions without the need for pre-trained models. Specifically, the first stage contains three residual blocks, whilst the following three stages contain four, six and three residual blocks, respectively. In Stage 0, the input image has dimensions of (3, 224, 224), representing the number of channels (C), height (H) and width (W), respectively. This stage comprises two key operations: the first layer performs a 7 × 7 convolution (COV) with a stride of 2, followed by batch normalisation (BN) and ReLU activation; the second layer performs a 3 × 3 max-pooling (MAXPOOL) operation with a stride of 2. Following these operations, the output feature map takes the form (64 56 56). In the subsequent four stages, the network progressively deepens, and the number of feature channels gradually increases. Each stage contains a specific number of residual blocks, which adjust the spatial dimensions of the feature maps through downsampling operations to enhance the network’s ability to capture complex features.

3.4.2. Enhanced ResNet-50 Model Overview

Drawing upon the relevant literature and existing research findings, we optimised and enhanced the ResNet-50 architecture. As illustrated in Figure 4, the enhanced ResNet-50 model incorporates structural optimisation and feature fusion design tailored for multi-label image recognition tasks, building upon the traditional residual network framework [71,72,73]. This model introduces a multi-scale feature fusion mechanism. Semantic features of varying levels, output from different stages of ResNet, are reduced to a uniform channel count via 1 × 1 convolutions. This enables scale alignment and concatenation, thereby balancing shallow-layer texture and edge information with deep-layer semantic expressiveness. Building upon this foundation, the model further incorporates Channel Attention (CA) and spatial attention (SA) modules to achieve feature importance re-weighting and enhancement of salient regions, thereby improving the model’s perception of critical information and feature discrimination capabilities [74,75]. Moreover, to prevent overfitting while bolstering the model’s generalisation capability, Dropout and Dropout2d regularisation mechanisms are incorporated within the feature fusion layer. Overall, the enhanced ResNet-50 model demonstrates superior robustness and accuracy in multi-label visual feature extraction and classification performance compared to the traditional ResNet architecture, achieved through the synergistic optimisation of multi-scale feature fusion, attention mechanisms, and regularisation strategies.

3.5. Model Training Steps

3.5.1. Dataset Construction

From 17,669 Ctrip travel images, 5000 high-quality photographs were selected through image quality screening and thematic clarity assessment. This formed a dataset encompassing elements related to the Yellow Crane Tower, ensuring both authenticity and diversity of sources. The dataset annotation was jointly completed by the author and relevant specialist researchers, each possessing pertinent knowledge backgrounds and holding at least a bachelor’s degree. The annotation process utilised the LabelImg tool to mark cultural elements with corresponding labels, saving them in YOLO format (details shown in Figure 5).

3.5.2. Image Processing and Data Augmentation

During the training phase, multiple data augmentation strategies were employed to enhance the model’s generalisation capability: Firstly, images were uniformly resized to 224 × 224 resolution via RandomResizedCrop, with a wide scaling range of 0.5–1.0 applied to improve scale invariance; subsequently, random horizontal flipping (probability 0.5) and ±20-degree random rotation were implemented to enhance spatial invariance. For colour enhancement, joint jittering of brightness, contrast, and saturation with an intensity of 0.5 was applied, supplemented by random greyscaling with a probability of 0.2. Additionally, Gaussian blurring (kernel size 3, probability 0.2) and random erasure (probability 0.2, erasure area 2–15%) were introduced to simulate image degradation in real-world scenarios. Following all enhancement operations, standard ImageNet normalisation was applied (mean [0.485, 0.456, 0.406], standard deviation [0.229, 0.224, 0.225]).

During the validation and testing phases, a standard preprocessing workflow was applied: images were resized proportionally to 256 pixels before being centred and cropped to 224 × 224. Only identical normalisation was performed without any augmentation techniques, ensuring fairness and reproducibility in evaluation.

This differentiated processing strategy effectively broadened the coverage of the data distribution through multidimensional augmentation during the training phase while ensuring the accuracy of model evaluation through standardised processing during the testing phase. The overall design aligns with best practices in deep learning.

3.5.3. Model Training

The core objective of model training is to continuously refine parameters through optimisation algorithms, thereby enhancing predictive accuracy for unseen data. Training data is partitioned into training, validation, and test sets: the training set updates model parameters, the validation set assesses model performance, and the test set ultimately evaluates learning outcomes. All images across these three datasets undergo expert annotation and review. The model generates predictions based on input data. When discrepancies arise between predicted and actual values, optimisation algorithms such as stochastic gradient descent (SGD) correct errors, driving parameters towards convergence to an optimal solution. This study employs a multi-label classification objective function based on binary cross-entropy loss [76,77], mathematically expressed as:

L_{B C E} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} [y_{i, c} l o g (σ (z_{i, c})) + (1 - y_{i, c}) l o g (1 - σ (z_{i, c}))]

(1)

where the symbol

L_{B C E}

denotes the binary cross-entropy loss function; N represents the total number of training samples, used to average the loss across all samples; C denotes the total number of categories (C = 2 in binary classification, but in multi-label classification, C represents the number of labels, with each label treated as an independent binary classification);

y_{i, c}

denotes the true label of the cth class for the ith sample, taking values 0 or 1 (1 indicates belonging to that class, 0 indicates not belonging);

σ (z_{i, c})

denotes the predicted probability of the i-th sample belonging to class c, computed via the sigmoid function with an output range of (0, 1), representing the model’s assessment of the sample’s likelihood of belonging to class c; and

y_{i, c} l o g (σ (z_{i, c}))

and

l o g (1 - σ (z_{i, c}))

denote the logarithm of the predicted probability, serving to quantify the ‘degree of alignment between the predicted probability and the true label’.

The training process employs the Adaptive Moment Estimation Weighted (AdamW) optimisation algorithm for parameter updates. This algorithm incorporates decoupled weight decay regularisation [78] into the standard Adam framework, with its mathematical expression being:

m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}

(2)

v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}

(3)

{\hat{m}}_{t} = m_{t} / (1 - β_{1}^{t}), {\hat{v}}_{t} = v_{t} / (1 - β_{2}^{t})

(4)

θ_{t} = θ_{t - 1} - η (\frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} + λ θ_{t - 1})

(5)

Adam (Adaptive Moment Estimation) combines the advantages of momentum and RMSProp, achieving efficient optimisation through adaptive learning rates, where Equation (2) denotes the first-order moment estimate (momentum);

g_{t}

represents the gradient at step t (the partial derivative of the model parameters with respect to the loss function);

β_{1}

denotes the momentum decay coefficient (typically set to 0.9), used to smooth gradients and reduce gradient fluctuations;

m_{t}

denotes the first-order moment estimate of the gradient (the cumulative gradient incorporating momentum), reflecting the gradient’s ‘trend’; Equation (3) denotes the second-moment estimate (adaptive learning rate);

β_{2}

represents the variance decay coefficient (typically set to 0.999), smoothing the gradient squared;

v_{t}

denotes the second-moment estimate of the gradient (cumulative gradient squared), reflecting gradient ‘volatility’; Equation (4) denotes bias correction; AdamW (Adam with Weight Decay) refines Adam’s weight decay implementation, with the core distinction lying in the parameter update step as per Equation (5);

θ_{t}

denotes the model parameters (e.g., weights, biases) at step t;

η

denotes the learning rate (controlling the step size of parameter updates);

λ θ_{t - 1}

denotes the weight decay term (AdamW’s core enhancement), directly applying L2 regularisation to parameters to suppress overfitting; and

\frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ}

combines momentum and adaptive learning rate gradient updates (Adam’s core logic), enabling parameter update steps to adaptively adjust based on their gradient’s magnitude and stability.

3.5.4. Model Evaluation Metrics

The evaluation criteria for models vary depending on the prediction task, such as binary classification, multi-class classification, and multi-label classification, each requiring distinct assessment metrics. Multi-label prediction models typically employ accuracy, macro F1 score, and micro F1 score to evaluate model precision [79,80]. The specific calculation formulas are as follows:

Accuracy: Accuracy denotes the proportion of correctly predicted samples among all predictions. It measures the overall correctness of the model’s predictions. Its formula is as follows:

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N} .

(6)

where TP (True Positive) denotes genuine positive instances, representing the number of samples correctly predicted as positive; TN (True Negative) denotes genuine negative instances, representing the number of samples correctly predicted as negative; FP (False Positive) denotes false positives, representing the number of samples erroneously predicted as positive; and FN (False Negative) denotes false negatives, representing the number of samples erroneously predicted as negative.

Precision: Precision denotes the proportion of samples correctly classified as positive among all samples predicted as positive by the model. It measures the accuracy of the model’s predictions. Its formula is as follows:

P r e c i s i o n = \frac{T P}{T P + F P} .

(7)

Recall: Recall denotes the proportion of all actual positive class samples correctly predicted as positive by the model. It measures the model’s ability to identify all positive class instances. Its formula is as follows:

R e c a l l = \frac{T P}{T P + F N} .

(8)

F1 score: The F1 score is the harmonic mean of precision and recall. When both precision and recall are very high, the F1 score approaches 1, representing an ideal scenario. Conversely, if either value is low, the F1 score is also affected and tends towards a lower value. Its calculation formula is as follows:

F 1 = \frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(9)

Micro F1 Score: Micro F1 is a metric calculated through global statistical computation based on prediction results across all categories. It treats each prediction instance equally rather than each category. Micro F1 does not require categorisation; it directly calculates the F1 score using the overall sample’s precision and recall. It first computes the total precision and recall across all categories, then applies the F1 formula to derive the Micro-F1 value. The calculation formula is as follows:

{P r e c i s i o n}_{m i c r o} = \frac{\sum_{i = 1}^{n} {T P}_{i}}{\sum_{i = 1}^{n} {T P}_{i} + \sum_{i = 1}^{n} {F P}_{i}} .

(10)

{R e c a l l}_{m i c r o} = \frac{\sum_{i = 1}^{n} {T P}_{i}}{\sum_{i = 1}^{n} {T P}_{i} + \sum_{i = 1}^{n} {F N}_{i}} .

(11)

F 1_{m i c r o} = 2 \cdot \frac{{P r e c i s i o n}_{m i c r o} \cdot R e c a l l_{m i c r o}}{{P r e c i s i o n}_{m i c r o} + R e c a l l_{m i c r o}} .

(12)

Macro F1 Score: The Macro F1 score is calculated by first computing the F1 score for each category independently and then taking the average. This approach treats each category equally without considering the size of its sample. Unlike micro F1, macro F1 requires first calculating the precision and recall for each category, along with their respective F1 scores, before determining the overall F1 score across the entire sample by taking the mean. The formula is as follows:

{P r e c i s i o n}_{m a c r o} = \frac{\sum_{i = 1}^{n} {P r e c i s i o n}_{i}}{n} .

(13)

{R e c a l l}_{m a c r o} = \frac{\sum_{i = 1}^{n} {R e c a l l}_{i}}{n} .

(14)

F 1_{m a c r o} = 2 \cdot \frac{{P r e c i s i o n}_{m a c r o} \cdot R e c a l l_{m a c r o}}{{P r e c i s i o n}_{m a c r o} + R e c a l l_{m a c r o}} .

(15)

The dataset was randomly partitioned into training, validation, and test sets at ratios of 70%, 15%, and 15%, respectively, comprising 3500 training images, 750 validation images, and 750 test images. Following the aforementioned parameter optimisation, the model’s F1 score results are illustrated in Figure 6, with the loss function depicted in Figure 7.

As demonstrated in Figure 6 and Figure 7, the results indicate that the enhanced multi-label classification model employing dropout regularisation (p = 0.5), reinforced L2 weight decay (1 × 10⁻³), and an early stopping mechanism significantly improves generalisation capability and training efficiency compared to the baseline architecture. The validation set macro F1 score reached 72.70%, and the micro F1 score reached 81.05%, while markedly reducing overfitting characteristics. Architectural regularisation achieved by inserting a dropout layer before the final classification layer, combined with spatial dropout (p = 0.25) within the feature fusion module, effectively constrained the model’s representational capacity and prevented co-adaptation between feature detectors. This property is precisely the root cause of the baseline model’s tendency towards training set memorisation. The gap between training and validation set performance plummeted from 17.81 percentage points in the unregularised baseline to 4.02 percentage points in the regularised model, reducing overfitting by 77.4%. Concurrently, the validation set macro F1 score improved by 2.25 percentage points and the micro F1 score by 2.19 percentage points. This demonstrates that appropriately calibrated regularisation enhances rather than diminishes generalisation performance. The validation loss curve exhibits markedly improved stability compared to the baseline model, maintaining relative stability between 0.12 and 0.13 from epoch 6 to 13. This contrasts with the unregularised architecture’s pronounced monotonic increase from 0.12 to 0.21, indicating that the regularisation strategy successfully curbed the progressive overfitting observed in the baseline model during later training stages. The early stopping mechanism terminated training at epoch 14 upon detecting that the validation set macro F1 score failed to surpass the optimal performance achieved between epochs 4 and 6 (where the validation set micro F1 peaked at approximately 81.5%) for ten consecutive epochs. This approach prevented further degradation of generalisation capabilities due to prolonged training while reducing computational expenditure by 72% (training duration: 5.6 h versus 24 h for the baseline model), compared to 24 h for the baseline model, achieving significant resource efficiency gains without compromising model quality. The regularised model’s training F1 scores were 81.08% macro and 85.07% micro; though lower than the baseline model’s 96.17% and 96.71%, they represent a more realistic and sustainable performance level. This indicates the model achieved genuine pattern learning rather than rote memorisation. The minimal 4.02% discrepancy between training and validation sets conclusively validates its efficacy. This figure falls well below the acceptable range for deep learning applications (typically 2–10% for regularised models) [81,82,83] and aligns with published benchmarks in the computer vision literature featuring comparable architectures and dataset scales. The validation set F1 score trajectory exhibited exceptional stability throughout training: the micro-F1 value remained consistently around 81% from epoch 4 to epoch 13, without the significant decline observed in baseline models, where performance plummeted from 81% to 79% in later epochs. This indicates that the regularisation strategy enables the model to identify and converge upon stable, generalisable feature representations rather than persistently fitting noise patterns specific to the training set. The observed 4.02% training-validation performance gap in the final model represents a theoretically expected and practically negligible degree of under-overfitting.

4. Results

4.1. Prediction Confidence Analysis

Following model training completion, the optimised model was deployed to classify 17,669 images. The Figure 8 illustrates the confidence distribution of the model’s predictions, reflecting the overall stability and reliability of the multi-label cultural symbol recognition model. The results indicate that the high-confidence interval (>0.8) accounted for 78.3% (13,838 images), demonstrating robust recognition performance with strong consistency across most images. The medium-confidence interval (0.5–0.8) accounted for 15.6% (2753 images), indicating that some images still yielded ambiguous predictions due to factors such as weakened symbol features, overlapping symbols, or visual interference. The low-confidence range (<0.5) constituted only 6.1% (1078 images), a relatively small proportion, suggesting these images may suffer from issues such as blurring, occlusion, or atypical symbol labelling. Overall, the model’s prediction confidence exhibits a favourable concentration trend, with high-confidence samples holding an absolute majority. This establishes a reliable data foundation for subsequent analyses of cultural symbol distribution and research into enhancement strategies.

From a cultural symbol perspective, these results indicate that landmark architectural structures function as dominant identity symbols in visitor perception, whereas interpretive and decorative elements serve as secondary or marginal symbolic carriers.

4.2. Statistical Analysis and Indicator Assessment of Cultural Symbol Recognition

A quantitative analysis of 17,667 social media image tags reveals pronounced heterogeneity in the visual representation of the Yellow Crane Tower, as shown in Table 2 and Figure 9. The results exhibit a clear stratified distribution pattern, in which architectural elements dominate the visual field, while interpretive and culturally mediated elements remain significantly underrepresented. From a cultural symbol perspective, this distribution reflects a hierarchical symbolic structure within visitor-generated imagery. The main architectural structure emerges as the dominant identity symbol (32.95%), functioning as the primary visual anchor of heritage recognition. Park landscapes (17.92%) and other man-made structures (18.11%) constitute secondary spatial symbols that extend the perceptual field beyond the core monument. In contrast, interpretive elements such as wayfinding and signage (0.59%), cultural and creative products (4.88%), and landscape features (5.27%) are marginal symbolic carriers, indicating limited visibility in user-generated visual communication. Cultural inscription elements, including plaques (11.46%), poetic inscriptions (4.84%), and interior decorations (12.57%), occupy an intermediate symbolic layer. While they reflect the site’s literary and historical connotations, they remain subordinate to visually dominant architectural forms. This indicates that cultural meaning is acknowledged but not visually prioritised in photographic practices. The significant variation in standard deviations (0.071–0.434) suggests a strongly polarised distribution, where symbolic elements are either highly salient or entirely absent. The 19.7-fold disparity between the most and least frequent categories further confirms a severe imbalance in symbolic representation. Weighted frequency analysis shows that main architectural structures accumulate 6082 equivalent occurrences, whereas signage-related elements account for only 296 instances, reinforcing a 20-fold disparity in symbolic visibility. The probability distribution characteristics further validate model robustness, indicating that these patterns reflect authentic behavioural tendencies rather than methodological bias. Overall, these findings reveal that visitor visual attention constructs a structured cultural symbol system characterised by symbolic hierarchy, in which architectural identity symbols dominate while interpretive and educational symbols remain marginal. Overall, these distributional patterns reflect a hierarchical structure of cultural symbols, in which architectural forms constitute primary visual identity symbols, while inscriptions, signage, and cultural products function as interpretive or contextual symbols with significantly lower visibility in user-generated visual representations.

The data reveals a dual-track differentiation in social media: one category comprises casual travel photography emphasising recognisable landmarks, while the other involves professional cultural documentation capturing specific cultural details.

4.3. Tourist Visual Attention and Cultural Expression Characteristics

The co-occurrence correlation matrix (Figure 10) further reveals the structural mechanisms underlying visitor-generated visual representation of the Yellow Crane Tower. The results demonstrate that cultural symbol production in social media imagery is not random but follows distinct compositional logics shaped by visual attention and symbolic selection. From a cultural symbol perspective, the observed patterns indicate a dual-structure symbolic system. The first is a ‘monumental symbolic mode’, characterised by the aggregation of architectural elements and the dominance of landmark structures. This is reflected in positive correlations between architectural structures and environmental elements such as urban skyline (r = 0.23) and park landscapes (r = 0.17), suggesting that visitors often construct a contextualised monument-centred symbolic frame. The second is a ‘contextual symbolic mode’, in which cultural meaning is partially extended through the inclusion of inscriptions and commemorative elements. The positive correlation between the main architectural structure and plaques (r = 0.16) indicates that some visitors actively integrate historical narrative symbols into their visual representation, reinforcing heritage meaning. However, significant negative correlations reveal strong compositional trade-offs in symbolic selection. The inverse relationship between main architectural structures and interior or auxiliary elements (r = −0.32, −0.27) indicates a substitution effect between monumental representation and cultural detail representation. This reflects a fragmented symbolic production process, where different layers of cultural meaning are rarely integrated within a single visual frame. Interpretive symbols such as signage and cultural products exhibit weak or near-zero correlations with all other categories (r = −0.01 to −0.16), confirming their peripheral position within the visual symbolic hierarchy. These elements are primarily functional or niche-oriented and are excluded from mainstream tourist imagery. Similarly, landscape elements show a contextual association with architectural forms but rarely function as independent symbolic subjects. Overall, these co-occurrence patterns reveal that visitor photography is governed by a selective symbolic construction mechanism in which heritage meaning is produced through the prioritisation of visually dominant architectural symbols and the partial exclusion of interpretive cultural layers. This results in a fragmented rather than fully integrated cultural symbol system in digital representations of heritage architecture.

Synthesising these associative patterns from a cultural symbol perspective reveals a structured symbolic system underlying visitor-generated imagery. The observed visual trajectories correspond to different layers of symbolic expression: commemorative isolation reflects the dominance of landmark identity symbols, environmentally contextualised documentation represents spatially extended symbolic integration, and culturally detail-oriented photography reflects interpretive and narrative symbol engagement. Overall, the results demonstrate a hierarchical and fragmented cultural symbol system within social media representations of heritage architecture, where symbolic visibility is unevenly distributed across architectural, environmental, and interpretive dimensions.

5. Discussion and Conclusions

5.1. Strategies for Enhancing Cultural Symbols

Drawing upon the foregoing findings regarding cultural symbol recognition and visual attention patterns, the study reveals an underlying hierarchical cultural symbol system within visitor-generated representations of the Yellow Crane Tower. The results demonstrate that architectural elements function as dominant identity symbols, while interpretive, environmental, and cultural–creative elements occupy secondary or marginal positions within the visual symbolic structure. Based on this symbolic interpretation, targeted strategies for enhancing cultural symbolism can be formulated in response to the observed imaging imbalances, with the aim of enriching the multidimensional representation of heritage meaning in social media environments. First, from the perspective of spatial organisation, diversified visitor itineraries should be introduced to mitigate congestion and to rebalance the spatial distribution of symbolic attention. The implementation of multi-path circulation systems, thematic exploration routes, and dispersed viewing nodes can reduce over-concentration on dominant architectural symbols, thereby promoting more comprehensive engagement with distributed cultural symbols across the site. Second, in light of the visual dominance of primary architectural symbols and the marginalisation of interpretive cultural symbols, the visibility of secondary symbolic elements should be enhanced through optimised display and interpretive systems. Measures such as high-visibility wayfinding signage, narrative panels for inscriptions, and enhanced night-time illumination can strengthen the communicative presence of underrepresented cultural symbols. Third, to address ‘commemorative isolation’ in visual representation, stronger contextual symbolic integration should be embedded within spatial and visual axes. By linking natural scenery, architectural ornamentation, and textual inscriptions into coherent visual narratives, visitors can be encouraged to construct more integrated symbolic representations of heritage space. Fourth, recognising that culturally detail-oriented photography is mainly produced by a limited group of highly engaged users, interactive digital media such as AR overlays, cultural exploration markers, and thematic photographic routes can be introduced to broaden public participation in cultural symbol interpretation and reproduction. Fifth, AI-driven analytical insights can inform the redesign and enhancement of underrepresented cultural symbols, including cultural products, interior furnishings, and landscape features, improving their aesthetic salience and narrative expressiveness within the overall symbolic system.

Collectively, these strategies aim to address the structural imbalance of cultural symbol representation revealed in this study. From a theoretical perspective, the findings contribute to understanding how cultural symbols in heritage architecture are selectively constructed, hierarchically organised, and visually reproduced within digital social media environments. This extends cultural symbol theory into the domain of data-driven visitor perception analysis and provides empirical support for the optimisation of heritage communication and symbolic representation.

5.2. Research Shortcomings and Future Directions for Improvement

Overall, while this study introduces deep learning-based multi-label recognition into cultural heritage architecture and quantitatively examines visitor preferences in cultural symbol representation, several limitations remain.

First, the research focuses on a single case, the Yellow Crane Tower. Although this landmark is highly recognisable and symbolically dense, other heritage types (e.g., industrial sites, vernacular landscapes, or lesser-known monuments) may exhibit different symbolic structures and dissemination patterns. The generalizability of the findings thus requires validation through comparative studies across diverse heritage contexts. Second, the dataset is derived exclusively from user-generated content on Ctrip Travel, a single tourism platform. Despite its scale, potential biases related to user demographics, posting motivations, and platform-specific communication norms may persist. Moreover, data collection was confined to a platform-imposed temporal window, and peak versus off-peak visitation conditions could not be reliably distinguished. Since high crowd density may constrain in-depth visual engagement, this limitation may introduce bias into the observed representation of cultural symbols. Future work should incorporate multi-platform, multitemporal datasets and official visitor flow statistics to enhance representativeness and enable more nuanced comparative analysis. Third, online image data lacks critical contextual metadata, including shooting distance, viewing angle, and precise spatial location, thereby restricting fine-grained spatial analysis of visitor perception. Additionally, without direct interaction with visitors, individual cultural backgrounds, prior knowledge, and interpretive intentions remain inaccessible. The analysis is thus confined to observable visual outputs rather than underlying cognitive drivers. Integrating social media data with surveys or user profiling could address this gap. Fourth, constrained by annotation costs and classification complexity, the cultural symbol typology adopted remains relatively coarse. The capacity to capture deeper semantic layers, such as narrative meaning, emotional valence, or symbolic interpretation, is limited. Recognition difficulties persist for fine-grained categories, low-frequency symbols, and visually ambiguous or occluded scenes. Finally, this study emphasises visual perception and symbolic representation rather than the material condition or temporal evolution of the heritage site itself. As noted by a reviewer, image-based digital technologies also hold promise for monitoring conservation status and architectural or landscape change over time, offering a valuable avenue for integrating perception-oriented research with heritage management.

Future research may thus pursue cross-heritage comparative analysis, multimodal data fusion, fine-grained symbol ontology construction, incorporation of visitor background data, and the development of explainable deep learning models. Additionally, integrating phenomenological or embodied cognition frameworks—such as gesture theory—through on-site behavioural observation or video analysis could help address the current limitation of static imagery in capturing the lived, kinaesthetic dimensions of heritage perception. Collectively, these directions would strengthen both the technical and theoretical foundations for understanding cultural heritage perception and inform more robust strategies for heritage communication and conservation.

6. Conclusions

This study employs a deep learning-based multi-label recognition methodology to systematically reveal the structural presentation of cultural symbols associated with the Yellow Crane Tower within social media imagery. The developed framework demonstrates robust performance in automatically identifying and categorising culturally salient visual elements from large-scale user-generated content, achieving reliable multi-label classification across diverse symbolic categories despite challenges posed by fine-grained distinctions and visually complex scenes. Its capacity to efficiently process extensive image datasets and extract statistically meaningful co-occurrence patterns constitutes a key strength of this research. The findings indicate that the visual expression of the heritage site exhibits a pronounced hierarchical structure centred on the main architectural edifice, with cultural details and contextual elements comparatively marginalised. Concurrently, co-occurrence relationships among labels reveal three typical pathways in visitor image production: commemorative isolated presentation, contextual environmental documentation, and detail-oriented cultural photography. These pathways reflect divergent modes of participation in heritage experiences and underlying cultural preferences. Nonetheless, several limitations persist, including reliance on a single data source, the inability to directly capture visitor motivations and emotional connections, and the potential for subjective bias inherent in expert annotation. Future research may incorporate more diverse platforms and demographic samples, expand the semantic depth of cultural symbol taxonomies, and integrate behavioural experiments with physiological measurements to deepen mechanistic understanding of visitor cultural experience. Moreover, data annotation processes should engage broader participation from cross-disciplinary experts and lay visitors to yield a more universally applicable landscape of cultural cognition. Overall, this study not only provides an empirical foundation for understanding the presentation of cultural symbols within social media environments but also offers both methodological and practical value for optimising cultural communication strategies and developing symbol enhancement interventions at heritage sites.

Author Contributions

Conceptualisation, L.L.; methodology, Z.L.; software, L.L. and Y.W.; validation, L.L. and C.Z.; formal analysis, C.Z.; data curation, C.Z.; writing—original draft, C.Z.; writing—review and editing, C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The research results of this study were funded by the National Scholarship Program (No. 202508420110).

Data Availability Statement

The original contributions presented in this study are included in this article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pranskūnienė, R.; Zabulionienė, E. Towards Heritage Transformation Perspectives. Sustainability 2023, 15, 6135. [Google Scholar] [CrossRef]
Lillevold, K.; Haarstad, H. The Deep City: Cultural Heritage as a Resource for Sustainable Local Transformation. Local Environ. 2019, 24, 329–341. [Google Scholar] [CrossRef]
Atalan, Ö. Importance of “Cultural Heritage and Conservation” Concept in the Architectural Education. HumanSciences 2018, 15, 1700. [Google Scholar] [CrossRef]
Nathaus, K.; Childress, C.C. The Production of Culture Perspective in Historical Research: Integrating the Production, Meaning and Reception of Symbolic Objects. Zeithistorische Forschungen/Stud. Contemp. Hist. 2013, 10, 1. [Google Scholar]
Kisusi, R.L.; Masele, J.J. Efficacy of Public Awareness Strategies for Promoting Existing Cultural Heritage Tourism Assets in Dar Es Salaam. J. Herit. Tour. 2019, 14, 117–137. [Google Scholar] [CrossRef]
Kisusi, R.L. Promoting Public Awareness on the Existing Cultural Heritage Tourism Sites: A Case of Dar Es Salaam City. Master’s thesis, University of Tanzania, Dar es Salaam, Tanzania, 2014. [Google Scholar]
Zhang, Y.; Han, M.; Chen, W. The Strategy of Digital Scenic Area Planning from the Perspective of Intangible Cultural Heritage Protection. J. Image Video Proc. 2018, 2018, 130. [Google Scholar] [CrossRef]
Dümcke, C.; Gnedovsky, M. The Social and Economic Value of Cultural Heritage: Literature Review. EENC Pap. 2013, 1, 101–114. [Google Scholar]
Wu, C.; Yang, M.; Zhang, H.; Yu, Y. Spatial Structure and Evolution of Territorial Function of Rural Areas at Cultural Heritage Sites from the Perspective of Social Space. Land 2023, 12, 1067. [Google Scholar] [CrossRef]
Liang, X.; Lu, Y.; Martin, J. A Review of the Role of Social Media for the Cultural Heritage Sustainability. Sustainability 2021, 13, 1055. [Google Scholar] [CrossRef]
Yan, J.; Yue, J.; Zhang, J.; Qin, P. Research on Spatio-Temporal Characteristics of Tourists’ Landscape Perception and Emotional Experience by Using Photo Data Mining. Int. J. Environ. Res. Public Health 2023, 20, 3843. [Google Scholar] [CrossRef]
Zhao, B.; Gao, Z.; Jiao, M.; Weng, R.; Jia, T.; Xu, C.; Wang, X.; Jiang, Y. Rural Image Perception and Spatial Optimization Pathways Based on Social Media Data: A Case Study of Baishe Village—A Traditional Village. Land 2025, 14, 1860. [Google Scholar] [CrossRef]
Buscombe, D.; Ritchie, A.C. Landscape Classification with Deep Neural Networks. Geosciences 2018, 8, 244. [Google Scholar] [CrossRef]
Andrianaivo, L.N.; D’Autilia, R.; Palma, V. Architecture Recognition by Means of Convolutional Neural Networks. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 2019, XLII-2/W15, 77–84. [Google Scholar] [CrossRef]
Meklati, S.; Boussora, K.; Abdi, M.E.H.; Berrani, S.-A. Surface Damage Identification for Heritage Site Protection: A Mobile Crowd-Sensing Solution Based on Deep Learning. J. Comput. Cult. Herit. 2023, 16, 1–24. [Google Scholar] [CrossRef]
Wang, Q.; Li, L. Museum Relic Image Detection and Recognition Based on Deep Learning. Comput. Intell. Neurosci. 2022, 2022, 9670191. [Google Scholar] [CrossRef]
Mathias, M.; Martinovic, A.; Weissenberg, J.; Haegler, S.; Van Gool, L. Automatic Architectural Style Recognition. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 2012, XXXVIII-5/W16, 171–176. [Google Scholar] [CrossRef]
Sewell, W.H. The Concept(s) of Culture. In Practicing History, 1st ed.; Routledge: London, UK, 2004. [Google Scholar]
McDowell, S. Heritage, Memory and Identity. In The Routledge Research Companion to Heritage and Identity, 1st ed.; Routledge: London, UK, 2016. [Google Scholar]
Yakin, H.S.M.; Totu, A. The Semiotic Perspectives of Peirce and Saussure: A Brief Comparative Study. Procedia—Soc. Behav. Sci. 2014, 155, 4–8. [Google Scholar] [CrossRef]
Lynch, K. The Image of the City; The MIT Press: Cambridge, MA, USA, 1960. [Google Scholar]
Fuchs, C. Henri Lefebvre’s Theory of the Production of Space and the Critical Theory of Communication. Commun. Theory 2019, 29, 129–150. [Google Scholar] [CrossRef]
Wollentz, G. War and Cultural Heritage—Biographies of Place. Herit. Soc. 2016, 9, 106–109. [Google Scholar] [CrossRef]
Wang, F. Born of Geographical Environment, Coloring for Regional Context: Concept and Progress of Geo-Architecture. J. Geogr. Sci. 2017, 27, 631–640. [Google Scholar] [CrossRef]
Qiu, Q.; Zuo, Y.; Zhang, M. Intangible Cultural Heritage in Tourism: Research Review and Investigation of Future Agenda. Land 2022, 11, 139. [Google Scholar] [CrossRef]
Cao, J.; Bai, Z.; Fan, L.; Su, J.; Liu, D.; Li, Y.; Wang, J. Preliminary Investigation and Significance on the Architectural Cultural Value: A Case Study of Ancient Theater Buildings in Shaanxi Central Plain Region. Preprint 2021. [Google Scholar]
Bahauddin, A. The ‘Sense of Place’ and the Environmental Context of Ar-Rahman Mosque Architecture. IOP Conf. Ser. Earth Environ. Sci. 2021, 881, 012010. [Google Scholar] [CrossRef]
Rababeh, S.M.; Hanaqtah, R.A.; Abu-Khafajah, S.A. Leveraging Digitized Heritage Technologies for Smart Management: Heritage Safeguarding and Interpretation Framework. Heritage 2024, 7, 6891–6915. [Google Scholar] [CrossRef]
Zhang, S.; Qi, Y.; Wu, J. Applying Deep Learning for Style Transfer in Digital Art: Enhancing Creative Expression through Neural Networks. Sci. Rep. 2025, 15, 11744. [Google Scholar] [CrossRef]
Blagojević, M.R.; Tufegdžić, A. The New Technology Era Requirements and Sustainable Approach to Industrial Heritage Renewal. Energy Build. 2016, 115, 148–153. [Google Scholar] [CrossRef]
Feng, L.; Rahman, R.; Dolah, M.S.B.; Che Me, R. Color Identity: A Color Model for Hebei Ancient Villages in Cultural Heritage Preservation and Sustainable Development. Buildings 2025, 15, 4536. [Google Scholar] [CrossRef]
Urry, J. The Tourist Gaze “Revisited”. Am. Behav. Sci. 1992, 36, 172–186. [Google Scholar] [CrossRef]
Lewicka, M. Place Attachment: How Far Have We Come in the Last 40 Years? J. Environ. Psychol. 2011, 31, 207–230. [Google Scholar] [CrossRef]
Dake, K. Orienting Dispositions in the Perception of Risk: An Analysis of Contemporary Worldviews and Cultural Biases. J. Cross-Cult. Psychol. 1991, 22, 61–82. [Google Scholar] [CrossRef]
Kongprasert, N.; Virutamasen, P. Tourist Perceptions to Cultural Identity: The Case of Thai Experience. Procedia—Soc. Behav. Sci. 2015, 195, 167–174. [Google Scholar] [CrossRef]
Shao, G.; Zhang, J.; Bu, L.; Wang, J. Cross-Cultural Perceptual Differences in the Symbolic Meanings of Chinese Architectural Heritage. Buildings 2025, 15, 3506. [Google Scholar] [CrossRef]
Jiang, Y.; Pashkevych, K.; Bi, S. Evaluating Visitor Perception and Spatial Preferences of Various Museums Based on Machine Learning from 2016 to 2024. PLoS ONE 2025, 20, e0327112. [Google Scholar] [CrossRef] [PubMed]
Aryal, B.; Chhetri, V.T.; Khanal, P. Perception of Local People and Visitors towards Ecotourism Development in Jagadishpur Reservoir. Int. J. Environ. 2022, 11, 71–85. [Google Scholar] [CrossRef]
Huang, Y.; Zheng, B. Social Media Users’ Visual and Emotional Preferences of Internet-Famous Sites in Urban Riverfront Public Spaces: A Case Study in Changsha, China. Land 2024, 13, 930. [Google Scholar] [CrossRef]
Liu, L.; Zhao, H. Research on Consumers’ Purchase Intention of Cultural and Creative Products—Metaphor Design Based on Traditional Cultural Symbols. PLoS ONE 2024, 19, e0301678. [Google Scholar] [CrossRef]
Huang, X.; Li, C.; Zhao, J.; Chen, S.; Gao, M.; Liu, H. Investigating Spatial Heterogeneity Patterns and Coupling Coordination Effects of the Cultural Ecosystem Service Supply and Demand: A Case Study of Taiyuan City, China. Land 2025, 14, 1212. [Google Scholar] [CrossRef]
Bonn, M.A.; Joseph-Mathews, S.M.; Dai, M.; Hayes, S.; Cave, J. Heritage/Cultural Attraction Atmospherics: Creating the Right Environment for the Heritage/Cultural Visitor. J. Travel Res. 2007, 45, 345–354. [Google Scholar] [CrossRef]
Buonincontri, P.; Marasco, A.; Ramkissoon, H. Visitors’ Experience, Place Attachment and Sustainable Behaviour at Cultural Heritage Sites: A Conceptual Framework. Sustainability 2017, 9, 1112. [Google Scholar] [CrossRef]
Tao, Y.; Jiang, Y.; Huang, M.; Zhou, K. Exploring visitor perception of Asian historic districts through deep learning and social media data. J. Asian Archit. Build. Eng. 2025, 1–20. [Google Scholar] [CrossRef]
Câmara, A.; De Almeida, A.; Caçador, D.; Oliveira, J. Automated Methods for Image Detection of Cultural Heritage: Overviews and Perspectives. Archaeol. Prospect. 2023, 30, 153–169. [Google Scholar] [CrossRef]
Aicardi, I.; Chiabrando, F.; Maria Lingua, A.; Noardo, F. Recent Trends in Cultural Heritage 3D Survey: The Photogrammetric Computer Vision Approach. J. Cult. Herit. 2018, 32, 257–266. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
Paolanti, M.; Frontoni, E. Multidisciplinary Pattern Recognition Applications: A Review. Comput. Sci. Rev. 2020, 37, 100276. [Google Scholar] [CrossRef]
Yang, F.; Ismail, N.A.; Pang, Y.Y.; Kebande, V.R.; Al-Dhaqm, A.; Koh, T.W. A Systematic Literature Review of Deep Learning Approaches for Sketch-Based Image Retrieval: Datasets, Metrics, and Future Directions. IEEE Access 2024, 12, 14847–14869. [Google Scholar] [CrossRef]
Gao, L.; Wu, Y.; Yang, T.; Zhang, X.; Zeng, Z.; Chan, C.K.D.; Chen, W. Research on Image Classification and Retrieval Using Deep Learning with Attention Mechanism on Diaspora Chinese Architectural Heritage in Jiangmen, China. Buildings 2023, 13, 275. [Google Scholar] [CrossRef]
Liu, J.; Li, R.; Fu, T.; Yin, Z.; Yang, Z. Interpretation Study of Unknown Oracle Bones Based on Self-Training Learning. In Proceedings of the 2022 4th International Conference on Video, Signal and Image Processing; ACM: Shanghai, China, 2022; pp. 7–11. [Google Scholar]
Liu, Y. Application Interface Design of Chongqing Intangible Cultural Heritage Based on Deep Learning. Heliyon 2023, 9, e22242. [Google Scholar] [CrossRef]
Azizifard, N.; Gelauff, L.; Gransard-Desmond, J.-O.; Redi, M.; Schifanella, R. Wiki Loves Monuments: Crowdsourcing the Collective Image of the Worldwide Built Heritage. J. Comput. Cult. Herit. 2023, 16, 1–27. [Google Scholar] [CrossRef]
Folino, F.; Foresta, M.F.; Maurmo, D.; Ruga, T.; Zumpano, E.; Vocaturo, E. AI Image-Based Systems for Enhancing the Cultural Tourism Experience. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData); IEEE: Washington, DC, USA, 2024; pp. 4720–4726. [Google Scholar]
Gîrbacia, F. An Analysis of Research Trends for Using Artificial Intelligence in Cultural Heritage. Electronics 2024, 13, 3738. [Google Scholar] [CrossRef]
Tian, X.; Li, N.; Ai, N.; Gao, S.; Li, C. Intelligent Identification of Rural Productive Landscapes in Inner Mongolia. Computers 2025, 14, 565. [Google Scholar] [CrossRef]
Ju, F. Mapping the Knowledge Structure of Image Recognition in Cultural Heritage: A Scientometric Analysis Using CiteSpace, VOSviewer, and Bibliometrix. J. Imaging 2024, 10, 272. [Google Scholar] [CrossRef]
Kumar, P.; Ofli, F.; Imran, M.; Castillo, C. Detection of disaster-affected cultural heritage sites from social media images using deep learning techniques. J. Comput. Cult. Herit. (JOCCH) 2020, 13, 1–31. [Google Scholar] [CrossRef]
Viñals, M.J.; Orozco Carpio, P.R.; Teruel, P.; Gandía-Romero, J.M. Real-time monitoring of visitor carrying capacity in crowded historic streets through digital technologies. Urban Sci. 2024, 8, 190. [Google Scholar] [CrossRef]
Bai, N.; Ducci, M.; Mirzikashvili, R.; Nourian, P.; Roders, A.P. Mapping urban heritage images with social media data and artificial intelligence, a case study in Testaccio, Rome. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 2023, XLVIII-M-2-2023, 139–146. [Google Scholar] [CrossRef]
Haddad, N.A.; Fakhoury, L.A.; Sakr, Y.M. A critical anthology of international charters, conventions & principles on documentation of cultural heritage for conservation, monitoring & management. Mediterr. Archaeol. Archaeom. 2021, 21, 291. [Google Scholar]
Schroeder, R. Big Data and the Brave New World of Social Media Research. Big Data Soc. 2014, 1, 2053951714563194. [Google Scholar] [CrossRef]
Dai, H.-N.; Wong, R.C.-W.; Wang, H.; Zheng, Z.; Vasilakos, A.V. Big Data Analytics for Large-Scale Wireless Networks: Challenges and Opportunities. ACM Comput. Surv. 2020, 52, 1–36. [Google Scholar] [CrossRef]
Bruns, A. Faster than the Speed of Print: Reconciling ‘Big Data’ Social Media Analysis and Academic Scholarship. First Monday 2013, 18, 1–5. [Google Scholar] [CrossRef]
Wang, R. The Influence of Chinese Travel Applications Content Marketing on Chinese Users’ Travel Decisions to Bangkok, Thailand. ASEAN J. Manag. Innov. 2025, 12, 5–25. [Google Scholar] [CrossRef]
Zhang, X. The Application of Ethnic Cultural Symbols in Modern Visual Communication Design. Sci. Soc. Res. 2021, 3, 11–33. [Google Scholar] [CrossRef]
Li, D. A Study of Relationship between Symbols and Cultures from the Perspective of Linguistics. In Proceedings of the 2018 2nd International Conference on Education Science and Economic Management (ICESEM 2018); Atlantis Press: Xiamen, China, 2018. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Las Vegas, NV, USA, 2016; pp. 779–788. [Google Scholar]
Xu, S.; Zhou, Y. AI-Enabled Cultural Feature Recognition and Cross-Cultural Comparison in Historic Architecture. Acad. Nexus J. 2025, 4. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Las Vegas, NV, USA, 2016; pp. 770–778. [Google Scholar]
Sebastian, N.; Ankayarkanni, B. Enhanced ResNet-50 with Multi-Feature Fusion for Robust Detection of Pneumonia in Chest X-Ray Images. Diagnostics 2025, 15, 2041. [Google Scholar] [CrossRef] [PubMed]
Li, L.; Wang, R.; Zou, M.; Guo, F.; Ren, Y. Enhanced ResNet-50 for Garbage Classification: Feature Fusion and Depth-Separable Convolutions. PLoS ONE 2025, 20, e0317999. [Google Scholar] [CrossRef] [PubMed]
Cheng, Y.; Chen, W. Cultural Perception of Tourism Heritage Landscapes via Multi-Label Deep Learning: A Study of Jingdezhen, the Porcelain Capital. Land 2025, 14, 559. [Google Scholar] [CrossRef]
Zhu, X.; Cheng, D.; Zhang, Z.; Lin, S.; Dai, J. An Empirical Study of Spatial Attention Mechanisms in Deep Networks. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Seoul, Republic of Korea, 2019; pp. 6687–6696. [Google Scholar]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency Channel Attention Networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Montreal, QC, Canada, 2021; pp. 763–772. [Google Scholar]
Ruby, U.; Yendapalli, V. Binary Cross Entropy with Deep Learning Technique for Image Classification. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 5393–5397. [Google Scholar] [CrossRef]
Ho, Y.; Wookey, S. The Real-World-Weight Cross-Entropy Loss Function: Modeling the Costs of Mislabeling. IEEE Access 2020, 8, 4806–4813. [Google Scholar] [CrossRef]
Zhou, P.; Xie, X.; Lin, Z.; Yan, S. Towards Understanding Convergence and Generalization of AdamW. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6486–6493. [Google Scholar] [CrossRef]
Takahashi, K.; Yamamoto, K.; Kuchiba, A.; Koyama, T. Confidence Interval for Micro-Averaged F1 and Macro-Averaged F1 Scores. Appl. Intell. 2022, 52, 4961–4972. [Google Scholar] [CrossRef]
Tatachar, A.V. Comparative Assessment of Regression Models Based On Model Evaluation Metrics. Int. Res. J. Eng. Technol. (IRJET) 2021, 8, 853–860. [Google Scholar]
Wang, W.; Li, Y.; Zou, T.; Wang, X.; You, J.; Luo, Y. A Novel Image Classification Approach via Dense-MobileNet Models. Mob. Inf. Syst. 2020, 2020, 7602384. [Google Scholar] [CrossRef]
Kim, K.G. Book Review: Deep Learning. Healthc. Inform. Res. 2016, 22, 351. [Google Scholar] [CrossRef]
Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006; Volume 4, No. 4, p. 738. [Google Scholar]

Figure 1. Research location.

Figure 2. Research framework.

Figure 3. ResNet-50 detailed architecture.

Figure 4. ResNet-50 detailed architecture.

Figure 5. Labelling interface for dataset annotation.

Figure 6. Micro-level F1 and Macro-level F1 training curve diagram.

Figure 7. Labelling interface for dataset annotation.

Figure 8. Prediction confidence map.

Figure 9. Probability distributions by tag category.

Figure 10. Co-occurrence matrix of labels.

Table 1. Classification of cultural symbols.

Classification System		Sample Image
Architectural cultural symbols	Main architectural structure
	Roof and eaves
	Secondary buildings
	Interior decoration and furnishings
Historical and cultural symbols	Poetic inscriptions
	Plaque
	Landscape features
Natural environment symbols	The Yangtze River and water scenery
	City skyline
	Landscaping within the park
Cultural and creative symbols	Cultural and creative products
Cultural and creative symbols	Wayfinding and signage

Table 2. Relevant indicator results (rounded to two decimal places).

Classification Tags	Average Probability	Average Probability	Weighted Counting	Minimum Probability	Maximum Probability
Main architectural structure	0.34	0.43	6082.3	0	1
Landscaping within the park	0.2	0.3	3593.53	0	0.99
Secondary buildings	0.2	0.32	3465.58	0	1
Plaque	0.17	0.26	3003.49	0	1
Interior decoration and furnishings	0.12	0.28	2170.62	0	1
City skyline	0.12	0.25	2033.02	0	0.99
Roof and eaves	0.1	0.24	1743.53	0	1
Poetic inscriptions	0.07	0.2	1319.72	0	1
Landscape features	0.07	0.19	1311.1	0	1
Cultural and creative products	0.05	0.2	907.36	0	1
The Yangtze River and water scenery	0.04	0.14	759.97	0	0.98
Wayfinding and signage	0.02	0.07	295.52	0	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, L.; Zhang, C.; Wang, Y.; Lueng, Z. Cultural Symbol Preferences of Visitors to Historical and Cultural Heritage Buildings: A Case Study of the Yellow Crane Tower Based on Social Media Data and Deep Learning. Buildings 2026, 16, 1636. https://doi.org/10.3390/buildings16081636

AMA Style

Li L, Zhang C, Wang Y, Lueng Z. Cultural Symbol Preferences of Visitors to Historical and Cultural Heritage Buildings: A Case Study of the Yellow Crane Tower Based on Social Media Data and Deep Learning. Buildings. 2026; 16(8):1636. https://doi.org/10.3390/buildings16081636

Chicago/Turabian Style

Li, Liyuan, Changzhi Zhang, Yibei Wang, and Zack Lueng. 2026. "Cultural Symbol Preferences of Visitors to Historical and Cultural Heritage Buildings: A Case Study of the Yellow Crane Tower Based on Social Media Data and Deep Learning" Buildings 16, no. 8: 1636. https://doi.org/10.3390/buildings16081636

APA Style

Li, L., Zhang, C., Wang, Y., & Lueng, Z. (2026). Cultural Symbol Preferences of Visitors to Historical and Cultural Heritage Buildings: A Case Study of the Yellow Crane Tower Based on Social Media Data and Deep Learning. Buildings, 16(8), 1636. https://doi.org/10.3390/buildings16081636

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cultural Symbol Preferences of Visitors to Historical and Cultural Heritage Buildings: A Case Study of the Yellow Crane Tower Based on Social Media Data and Deep Learning

Abstract

1. Introduction

2. Literature Review

2.1. A Study on Cultural Symbols in Heritage Architecture

2.2. Research on Visitor Perception and Cultural Symbol Preferences

2.3. Research on Cultural Heritage Image Recognition Based on Deep Learning

3. Methodology

3.1. Study Site and Data Collection

3.2. Research Framework Description

3.3. Cultural Symbol Classification System

3.4. Multi-Label Cultural Perception Model

3.4.1. ResNet-50 Model Overview

3.4.2. Enhanced ResNet-50 Model Overview

3.5. Model Training Steps

3.5.1. Dataset Construction

3.5.2. Image Processing and Data Augmentation

3.5.3. Model Training

3.5.4. Model Evaluation Metrics

4. Results

4.1. Prediction Confidence Analysis

4.2. Statistical Analysis and Indicator Assessment of Cultural Symbol Recognition

4.3. Tourist Visual Attention and Cultural Expression Characteristics

5. Discussion and Conclusions

5.1. Strategies for Enhancing Cultural Symbols

5.2. Research Shortcomings and Future Directions for Improvement

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI