1. Introduction
As humanity’s fundamental living carrier, vernacular dwellings embody rich historical and cultural information; they are material witnesses that reflect social change, technological evolution, and shifting aesthetics [
1,
2]. Over millennia, traditional dwellings have developed an integrated system that combines technical norms with artistic creation. Their construction techniques, spatial forms, and decorative features crystallize the social structures, cultural concepts, and ecological wisdom of specific historical periods [
3]. Such “living heritage” possesses not only the material value of the buildings themselves, but also serves as a spiritual anchor for regional cultural identity. Yet, amid rapid urbanization, the spread of modern building technologies has marginalized traditional crafting skills, triggering severe problems such as architectural homogenization and the rupture of craft transmission [
4,
5,
6]. In particular, the dating of traditional dwellings—the foundational step in heritage protection and research—directly affects the interpretation of historical information, the selection of restoration techniques, and the formulation of planning decisions.
Although determining the construction period of traditional vernacular dwellings is of multiple value—encompassing historical research, craft transmission, and planning management—it remains constrained by severe methodological limitations.
Historical research and cultural heritage: accurate dating helps reconstruct the social structure, household forms, and lifestyles of specific periods. Architectural heritage protection and technology: In cultural heritage conservation, the period label can be matched to contemporary techniques and materials. For example, the label can call up a typical material library of the corresponding period (such as late-Qing blue bricks or Republican cement mosaics) to provide a basis for restoration material selection and prevent chronological dissonance. Urban and rural planning: Understanding the architectural features of different periods avoids conflicts between new construction and the historical environment. For example, overlaying the period layer onto the national spatial-planning “one map” automatically generates a three-tier zoning of “historical-character core protection zone—coordination zone—modern zone”; governments can then prioritize the repair of high-risk, low-scarcity precincts and postpone intervention in high-density, well-preserved areas, thus reducing wholesale demolition and reconstruction. The list of scarce-era buildings output by the model can also be directly incorporated into the traditional dwelling restoration subsidy catalogue, increasing subsidy amounts.
However, existing identification methods exhibit significant shortcomings: documentary methods rely on scarce written records; typological analyses are constrained by the subjectivity of expert experience; and techniques such as
14C dating are costly and damage samples [
7,
8]. These limitations severely hinder large-scale vernacular-dwelling surveys and conservation efforts. In recent years, although deep learning has demonstrated strong potential in architectural-feature recognition [
9,
10], the international community has begun to explore its application in architectural heritage for tasks such as component extraction [
11], defect diagnosis [
12], style identification [
13] and period dating [
14], thus offering new research perspectives for dating traditional dwellings. Nevertheless, three deficiencies remain in studies on Chinese vernacular dwellings: (1) the absence of multi-scale analytical methods that integrate global morphology with local detail features; (2) insufficient consideration of the impact of imbalanced temporal-class distributions on model performance; and (3) the difficulty of directly applying existing international findings [
11,
12,
13,
14] to the characteristics of China’s timber-frame vernacular system. Therefore, developing an accurate and practical intelligent dating method for vernacular dwellings is both a technological imperative in the digital humanities era and an urgent need to solve key issues such as “chronological dissonance” in conservation practice. Thus, researching a deep learning model for classifying and visually analyzing traditional vernacular dwellings has an important value for historical research and cultural transmission, cultural-relic protection and restoration, urban and rural planning and heritage management, architectural science and technological history, and cultural tourism and public education.
This study presents a novel deep learning-based model for chronological classification of traditional vernacular architecture. The key contributions are as follows:
Through the survey of traditional residential buildings in the Longzhong region of Gansu Province, we classified and summarized the construction era of each dwelling, establishing an image dataset of traditional residential buildings from different eras in the Longzhong region of Gansu Province;
Through comparison of Accuracy, Precision, Recall, and F1-score metrics, we selected models more suitable for our classification task from the three models—EfficientNet, ResNet50, and Vision Transformer—namely, the EfficientNet and ResNet50 models;
To verify the advantage of our proposed “Local-Global Feature Joint Learning” model in classification tasks, we evaluated the enhanced DualBranchEfficientNet and DualBranchResNet50 models;
To solve the sample imbalance problem, we introduced the mixed triplet loss model based on the DualBranchEfficientNet and DualBranchResNet50 models and conducted comparative ablation experiments.
To verify the performance of the DualBranchEfficientNet model incorporating mixed triplet loss, we conducted per-era evaluation of various metrics for it and simultaneously conducted a comparison of its confusion matrix.
Using Grad-CAM to generate heat maps, we analyzed the features extracted by the DualBranchEfficientNet model with hybrid triplet loss from traditional dwellings of different eras. The results confirm that the model’s focus aligns closely with that of traditional methods, further validating its credibility.
The structure for the remaining sections of the paper is outlined as follows:
Section 2 reviews prior work on traditional dating methodologies, machine learning approaches for architectural age prediction, multi-feature fusion techniques, and strategies for addressing class imbalance.
Section 3 details the construction of our dataset and presents the proposed classification model—DualBranchEfficientNet integrated with a hybrid triplet loss function.
Section 4 describes the experimental setup, reports quantitative results, and provides a comparative analysis. Finally, the paper concludes by presenting the findings and summarizing the key insights obtained.
4. Results and Discussion
In this section, we analyze and introduce the experimental data obtained from the methods proposed in this paper, namely, the use of deep learning methods for chronological classification of traditional dwellings in the Longzhong Loess Plateau region of Gansu. First, our study of traditional dwellings in the Longzhong Loess Plateau region of Gansu is divided by different eras, utilizing five different deep learning models: ResNet-50 [
61], EfficientNet [
62], Vision Transformer [
63], DualBranchResNet, and DualBranchEfficientNet, to classify traditional dwellings from different eras. This allows us to determine the differences in performance among various deep learning models for classifying traditional dwellings from different eras in the Longzhong region of Gansu Province.
Secondly, to better address class imbalance and enhance the model’s ability to recognize and extract image features, we employed a hybrid triplet loss strategy to conduct ablation experiments on the model’s chronological classification task for architectural dating.
Lastly, to further verify the classification performance of the proposed model, we compared the DualBranchEfficientNet model with the hybrid triplet loss function against the traditional cross-entropy model in terms of classification metrics and confusion matrices across different eras.
4.1. Experimental Environment
To ensure the fairness and accuracy of the experiments, all input image data resolutions were adjusted to 224 × 224 pixels, and the batch size was set to 64. All models underwent training for 200 epochs. As shown in
Table 5, the system used in this study is Ubuntu 20.04, with Python 3.8 as the programming language and PyTorch 2.1.0 as the environment. The graphics card is a NVIDIA RTX 4090D using CUDA 12.1, and the CPU is an Intel i9-14900k. The AdamW optimizer was used, with the initial learning rate set to 1 × 10
−2, and the final learning rate set to 0.01 times the maximum learning rate. The learning rate was reduced using a cosine annealing schedule.
4.2. Model Evaluation Metrics
In order to accurately and objectively evaluate the model’s performance on the classification task, a confusion matrix was first established to visually present the classification results. Secondly, we adopted four metrics—Accuracy, Precision, Recall, and F1-score—as criteria for judging the quality of classification, with each metric summarized as follows:
Confusion Matrix: It is the basis for evaluating the performance of classification models, intuitively showing the relationship between the model’s predictions and the actual labels in a tabular form, especially for binary classification problems. For binary classification tasks, the confusion matrix is a 2 × 2 matrix, as shown in
Table 6, where each row represents the actual class and each column represents the predicted class.
TP (True Positive): Correctly predicted positive cases (correct identification);
FP (False Positive): Negative cases incorrectly predicted as positive (Type I Error, false alarm);
FN (False Negative): Positive cases incorrectly predicted as negative (Type II Error, missed detection);
TN (True Negative): Correctly predicted negative cases (correct rejection).
Accuracy: It is the most intuitive classification model evaluation metric, measuring the proportion of overall correct predictions and serving as a preliminary indicator of global model performance. Its calculation formula is expressed as follows:
Precision: It is a critical evaluation metric for classification models, quantifying the proportion of true positive predictions among all positive predictions (i.e., the accuracy of positive predictions). It specifically emphasizes minimizing false alarms (False Positives, FP) and is computed as follows:
Recall: Also known as sensitivity or true positive rate (TPR), it is a core performance metric for classification models. It quantifies the model’s ability to identify all actual positive instances, with its primary focus on minimizing missed detections (False Negative, FN). It is computed as follows:
F1-score: It is one of the most critical composite metrics in classification model evaluation, defined as the harmonic mean of Precision and Recall. It is particularly suitable for imbalanced datasets or scenarios requiring simultaneous minimization of both false positive (FP) and false negative (FN). It is computed as follows:
4.3. Baseline Model Evaluation and Selection
We conducted comparative experiments using three representative deep learning classification models: ResNet50, EfficientNet-b4 (which has a comparable number of parameters to ResNet50 and is hereafter referred to as EfficientNet in this paper), and Vision Transformer—a Transformer-based architecture suitable for classification tasks.
The results are presented in
Table 7. The EfficientNet model achieved Accuracy, Precision, Recall, and F1-scores of 85.1%, 81.6%, 81.0%, and 81.1%, respectively. The ResNet50 model performed slightly below EfficientNet across all metrics, with scores of 83.7%, 80.3%, 79.5%, and 79.9%. The Vision Transformer model yielded the lowest scores among the three models at 77.0%, 72.5%, 71.5%, and 72.0%, respectively.
The results above indicate that Vision Transformer (ViT) delivered the weakest classification performance. This is primarily attributed to its core Transformer encoder architecture, which relies on self-attention mechanisms, contrasting with the CNN-based frameworks of EfficientNet and ResNet50. Regarding image input processing: EfficientNet and ResNet50 utilize local convolutional operations, whereas ViT employs linear embedding of image patches (lacking local inductive bias). Although ViT showcases powerful global modeling capacity for image classification, it exhibits high dependency on large-scale datasets—typically requiring over a million labeled samples to fully leverage its self-attention advantages. Conversely, EfficientNet and ResNet50 demonstrate superior performance on small-to-medium-sized datasets. These factors collectively explain the performance differences observed in the table. Our task involves classifying buildings from different eras, where distinguishing features between adjacent periods are often subtle. These differences primarily manifest in architectural details such as windows, roofs, eaves, materials, and structural elements.
In summary, given the specific classification task characteristics and dataset constraints, EfficientNet and ResNet50 models are better suited for the traditional residential building classification task in the Longzhong region of Gansu Province.
4.4. Improved Model Evaluation
The architectural characteristics across different eras manifest not only in macro-feature differences such as building structures, roof typologies, and courtyard layouts, but more significantly in micro-level local feature distinctions including carved patterns on doors/windows, brick/tile detailing, decorative motifs, building materials, and door/window proportions. Therefore, to better extract era-specific architectural features and enhance model classification accuracy, as established in
Section 4.3, we selected EfficientNet and ResNet50 as backbone networks. Upon these, we constructed a dual-branch network, namely, the “Local-Global Feature Joint Learning Network Architecture.” Under identical experimental conditions, we compared the backbone networks with the improved network, with the results presented in
Table 8.
In
Table 8, the DualBranchResNet model achieved an Accuracy of 87.6%, Precision of 85.0%, Recall of 84.5%, and F1-score of 84.8%, which are 3.9%, 4.7%, 5.0%, and 4.9% higher than those of the ResNet50 model, respectively; the DualBranchEfficientNet-b4 model achieved an Accuracy of 89.6%, Precision of 87.7%, Recall of 86.0%, and F1-score of 86.7%, which are 4.5%, 6.1%, 5.0%, and 5.7% higher than those of the EfficientNet-b4 model, respectively; compared to the DualBranchResNet model, the DualBranchEfficientNet model achieved 2.0%, 2.7%, 1.6%, and 1.9% higher Accuracy, Precision, Recall, and F1-score, respectively. This is because the improved dual-branch network model adopts the “Local-Global Feature Joint Learning Network Architecture,” which for traditional residential buildings of different eras can extract not only macro-level differences in building structures, roof forms, and spatial layouts, but also accurately capture micro-level local features such as building materials, carved decorations, door/window styles, and building materials. Simultaneously, as evidenced by the table, our proposed DualBranchEfficientNet model achieved optimal performance across Accuracy, Precision, Recall, and F1-scores for classifying traditional residential buildings in the Longzhong region of Gansu Province across different eras.
4.5. Ablation Experiment
In the era classification task for traditional residential buildings in the Longzhong region of Gansu Province, the collected data from different periods exhibited a class imbalance problem, as shown in
Table 3 and
Figure 3. This imbalance primarily occurred because our data collection originated from the purpose of cultural heritage preservation; thus, during the collection process, greater emphasis was placed on traditional residential buildings from the pre-1911 era and the 1912–1949 period. For post-1949 buildings, due to social transformations and economic development, representative traditional residential structures are relatively scarce. Consequently, the data for the 1949–1980 and post-1981 periods became imbalanced compared to the pre-1911 and 1912–1949 periods. To address this, we employed a mixed triplet loss strategy to mitigate the class imbalance issue.
First, to determine the triplet loss weight adjustment coefficient ε, we experimented with different ε values; the classification results corresponding to each ε are presented in
Table 9.
Based on the above pattern, focal loss contributes the most within the hybrid loss, while triplet loss also provides a noticeable contribution; therefore, we set ε to 0.3.
To validate our proposed method for addressing class imbalance, we conducted ablation studies on the DualBranchResNet and DualBranchEfficientNet models, with the results shown in
Table 10.
As shown in
Table 10, compared to the traditional cross-entropy (ce) model, the DualBranchResNet and DualBranchEfficientNet models with mixed triplet loss achieved significant improvements in Accuracy and F1-score. The DualBranchResNet model with mixed triplet loss attained Accuracy, Precision, Recall, and F1-scores of 88.8%, 87.0%, 85.5%, and 86.2%, respectively, representing increases of 1.2%, 2.0%, 1.0%, and 1.4% over the traditional cross-entropy (ce) model across these metrics. The DualBranchEfficientNet model achieved metrics of 90.5%, 88.9%, 87.6%, and 88.2%, outperforming the traditional cross-entropy (ce) model by 0.9%, 1.2%, 1.6%, and 1.5%, respectively. When comparing the DualBranchResNet and DualBranchEfficientNet models both incorporating mixed triplet loss, all four metrics of the DualBranchEfficientNet model are higher than those of the DualBranchResNet model. Consequently, the DualBranchResNet model with mixed triplet loss is better suited for our classification task.
4.6. DualBranchEfficientNet Model Per-Class and Confusion Matrix Comparison
4.6.1. Per-Class Comparison
To validate the classification performance of the improved model across different eras, as shown in
Table 11, we conducted a comparative analysis of classification metrics between the traditional cross-entropy model and the model incorporating mixed triplet loss.
As shown in
Table 11, for the pre-1911 era, both methods achieved exceptionally high performance, indicating that with sufficient data volume, models can stably learn features. However, the model with mixed triplet loss exhibited a slight 1.3% decrease in Precision, potentially due to diminished marginal optimization effects of triplet loss on abundant samples.
For the 1912–1949 era, the F1-score increased from 92.5% to 93.6% with minor performance fluctuations, suggesting stable classification for this period.
In the 1950–1980 era (with the fewest samples), Recall improved from 72.7% to 80.0% for the mixed triplet loss model, demonstrating that triplet loss enhances discriminability for minority classes through feature contrast when samples are scarce. However, Precision decreased by 1.8% (83.3→81.5) for this model, necessitating a trade-off between Recall and Precision (i.e., F1-score). The F1-score of the mixed triplet loss model was 4.1% higher than that of the traditional cross-entropy model (77.6%→81.7%).
For the post-1981 era, the mixed triplet loss model outperformed the cross-entropy model with a 4.5% Precision gain (82.8%→87.3%) and a 2.1% F1-score improvement (0.814→0.835), indicating effective suppression of misclassification for modern buildings.
The above demonstrates that the mixed triplet loss model delivers optimal performance compared to other classification models.
4.6.2. Confusion Matrix
To better understand the classification performance of the improved model across different eras, as shown in
Figure 5, we generated confusion matrices for both the traditional cross-entropy model and the model incorporating mixed triplet loss, further validating the improvement in handling class imbalance. The figure displays confusion matrices for both models, revealing that for the pre-1911 and post-1981 eras, the number of correctly classified images remained unchanged at 127 and 48 images, respectively. For the 1912–1949 era, correctly classified images decreased from 104 to 103—a reduction in one image—which does not reflect meaningful changes in model performance. Conversely, for the smallest sample era (1950–1980), correctly classified images increased from 40 to 44, demonstrating that the mixed triplet loss model moderately improves performance under class imbalance.
As shown in
Figure 5b, the highest misclassification rate occurs between 1912 and 1949 dwellings and pre-1911 dwellings. This stems from several factors. First, early Republican-period houses inherited almost all Qing era practices in floor plans, structures, roof pitches, and timber systems, resulting in minimal macroscopic differences. Only subtle distinctions—such as window-frame patterns, brick-carving motifs, column diameters, bracket-set proportions, or ridge-beast counts—require high-resolution images to be discerned, yet the model operating on 224 × 224 inputs often overlooks these fine cues, leading to errors. Second, although the two periods appear numerically balanced, most surviving pre-1911 buildings were erected near the turn of the century, and many underwent partial renovations after 1912 (e.g., replacing window sashes, adding canopies), creating “temporally mixed” samples whose outward features closely resemble those of 1912–1919, further aggravating misclassification.
For 1950–1980 versus post-1981 dwellings, the hybrid triplet loss model raises accuracy, yet a residual error remains. The 1950–1980 period already produced simplified quadrangles and brick-concrete bungalows; after 1981, numerous houses retained the same layout. Moreover, cement, red brick, and machine-made tiles introduced in the late 1950–1980 period continued to be used after 1981, merely in brighter colors, causing texture and color overlap in the extracted features. Additional renovations—tile cladding, aluminum window replacements—post-1981 make 1950–1980 buildings visually closer to post-1981 ones, sustaining the error rate.
Conversely, dwellings from pre-1911 and 1912–1949 differ markedly from the 1950–1980 and post-1981 cohorts in floor plans, materials, and overall morphology; consequently, the model extracts distinctive features and achieves low misclassification.
In conclusion, through comparative analysis of Accuracy, Precision, Recall, F1-score, and confusion matrices across different models, the DualBranchEfficientNet model incorporating mixed triplet loss outperforms other models in classifying traditional residential buildings from different eras.
4.7. Grad-CAM Analysis
As stated above, although the model demonstrates outstanding performance in the classification of vernacular building periods, the inherent opacity of deep learning means that the model offers limited transparency during sample learning. Heat maps generated via Grad-CAM not only reveal the model’s ability to extract and learn architectural features—thereby enhancing its robustness and reliability—but also bolster user trust in the model’s decisions.
Figure 6 presents the Grad-CAM heat maps generated by the DualBranchEfficientNet model with hybrid triplet loss for traditional dwellings of different periods. The heat maps use red-yellow-blue to indicate model weights from strong to weak. Across all periods, architectural features are mainly concentrated on the eaves, walls, and windows, which consistently appear as red high-response regions, showing that the model first relies on overall contours to distinguish chronological levels. (1) For dwellings built before 1911, the local-branch heat maps focus on the eaves, with orange-yellow high weights on dougong (bracket sets) and queti (sparrow braces), indicating that the model captures the intricate wood carvings characteristic of the late Qing period. (2) In the 1912–1949 period, weights concentrate on the eaves and column diameters; due to social and economic factors, dwellings of the Republican era simplified dougong and omitted queti. (3) During 1950–1980, the local-branch heat maps attend not only to the eaves, but also to the windows and walls. In this period, the junction between the wall and the roof retains the chengliang fang (purlin plate), yet the rest of the wall shifts from timber to brick-and-earth, the window area decreases, the columns disappear, and the once-deep eaves vanish. (4) After 1981, the red highlights move to the brick-wall exterior and glass windows.
Although the model demonstrates strong feature extraction and classification capabilities, misclassification still occurs. To intuitively reveal the underlying causes of these errors, we visualized typical misclassified samples using Grad-CAM.
As shown in
Figure 7a, the presence of colorful decorative bands under the eaves led the model to mistakenly assign a 1912–1949 building to the pre-1911 period. In
Figure 7b, post-renovation tiling over the original timber façade caused the model to misclassify a pre-1911 building as post-1981.
Figure 7c illustrates that a neighboring post-1981 brick wall on the right side of the image biased the model, resulting in a 1950–1980 building being labeled as post-1981.
Figure 7d shows that the long shooting distance misled the model into classifying a post-1981 building as belonging to the 1950–1980 period.
The foregoing shows that, although deep learning models perform well on architectural classification and recognition tasks, they still exhibit clear limitations relative to humans: (1) strong data dependence—requiring large training sets and being sensitive to data distribution, with marked performance drops when confronted with unseen building types; (2) limited spatial understanding—sensitive to scale changes, so that the same building photographed from different distances can yield different classifications; (3) poor adaptability—variations in lighting, weather, or viewing angle significantly affect performance; (4) difficulty in fine-grained discrimination—confusing buildings with similar styles or local features (e.g., Baroque vs. Rococo) and over-attending to certain local cues, leading to misclassification of the entire style; (5) interpretability deficits—unable to provide transparent rationales for learning and decisions; even with Grad-CAM and similar tools, heat maps may highlight irrelevant features yet still produce correct labels; (6) inability to distinguish originals from imitations—unable, for instance, to differentiate authentic ancient buildings from modern replicas. Therefore, AI systems should serve as “auxiliary tools,” and any restoration or planning decision must ultimately rely on human verification.
5. Conclusions
To address challenges including feature extraction difficulties in era classification of traditional architecture, cumbersome traditional methods, and erratic classification outcomes, we propose a DualBranchEfficientNet model incorporating mixed triplet loss for era classification of traditional residential buildings in the Longzhong region of Gansu Province.
First, due to the scarcity of datasets in the study area, we constructed a dataset of traditional buildings from different eras in the Longzhong region of Gansu Province. This dataset comprises 1181 photos of traditional residential buildings across eras: 427 from the pre-1911 period, 373 from 1912–1949, 181 from 1951–1980, and 200 from the post-1981 period to present.
Second, we selected three representative deep learning classification models—EfficientNet, ResNet50, and Vision Transformer—for comparative experiments. Experimental results show that the EfficientNet model achieved Accuracy, Precision, Recall, and F1-scores of 85.1%, 81.6%, 81.0%, and 81.1%, respectively, outperforming the ResNet50 model by 1.4%, 1.3%, 0.5%, and 1.2%, and surpassing the Vision Transformer model by 8.1%, 9.1%, 9.5%, and 9.1% across these metrics. Through comparative evaluation, EfficientNet and ResNet50 are better suited for our classification task than Vision Transformer.
Third, to enhance feature extraction capability and classification accuracy for traditional residential buildings across eras, we propose a “Local-Global Feature Joint Learning Network Architecture” based on the selected EfficientNet and ResNet50 models, namely, the DualBranchEfficientNet and DualBranchResNet models. Comparative results against the EfficientNet and ResNet50 models show that the DualBranchResNet model achieved Accuracy of 87.6%, Precision of 85.0%, Recall of 84.5%, and F1-score of 84.8%, outperforming ResNet50 by 3.9%, 4.7%, 5.0%, and 4.9%, respectively; the DualBranchEfficientNet model attained Accuracy of 89.6%, Precision of 87.7%, Recall of 86.0%, and F1-score of 86.7%, surpassing EfficientNet by 4.5%, 6.1%, 5.0%, and 5.7%, respectively. Thus, our proposed DualBranchEfficientNet model delivers optimal performance across Accuracy, Precision, Recall, and F1-scores for classifying traditional residential buildings in the Longzhong region of Gansu Province across different eras.
Fourth, to solve the sample imbalance problem, we improved the DualBranchEfficientNet and DualBranchResNet models by introducing mixed triplet loss and conducted comparative ablation experiments. Compared with the traditional cross-entropy model, the DualBranchResNet and DualBranchEfficientNet models incorporating mixed triplet loss achieved good results in Accuracy and F1-scores. Comparing the DualBranchResNet and DualBranchEfficientNet models both incorporating mixed triplet loss, the DualBranchEfficientNet model attained Accuracy, Precision, Recall, and F1-scores of 90.5%, 88.9%, 87.6%, and 88.2%, respectively, outperforming the DualBranchResNet model by 1.7%, 1.9%, 2.1%, and 2.0% across these metrics. It is concluded that our proposed DualBranchEfficientNet model incorporating mixed triplet loss is better suited for our classification task.
Fifth, to better understand our proposed model, we compared the classification metrics per category and confusion matrices of the DualBranchEfficientNet model and the DualBranchEfficientNet model incorporating mixed triplet loss. For the 1950–1980 era with the fewest samples, Recall improved from 72.7% to 80.0% for the mixed triplet loss model, and correctly classified images increased from 40 to 44. This reflects that the model incorporating mixed triplet loss has achieved certain improvement in performance under sample imbalance.
Finally, by analyzing the heat maps produced by the DualBranchEfficientNet model with the hybrid triplet loss function, we confirmed that the extracted features of traditional dwellings from different historical periods are consistent with those obtained by conventional methods, further attesting to the model’s reliability. Moreover, examination of the heat maps corresponding to misclassified instances demonstrated that the model exhibits strong robustness against overfitting.
Through this study, the proposed DualBranchEfficientNet model incorporating mixed triplet loss demonstrates good effect on feature extraction and classification of traditional residential buildings across eras in the Longzhong region of Gansu Province. It can serve as a new method for building era classification and identification, and when combined with traditional methods, enables more accurate era recognition. Simultaneously, it may provide reference for rural revitalization and rural landscape character control. Although this study demonstrates strong results in the Longzhong region of Gansu Province, the model’s generalizability to other regions has not yet been verified due to a lack of external data; in the future, we plan to conduct cross-regional collaborations to expand validation and assess its universality.
In the future, we will further investigate multi-scale fusion strategies that integrate deep learning with multi-source data—such as hyperspectral imagery, LiDAR point clouds, and historical archives—by incorporating attention mechanisms and spatio-temporal feature coupling to enhance the accuracy and robustness of building-age classification. Concurrently, we plan to construct a standardized, cross-regional test set to evaluate the model’s transferability and generalizability across buildings from diverse cultural contexts, ultimately establishing an intelligent classification framework adaptable to varied heritage scenarios and providing a scientific foundation for the digital preservation and monitoring of architectural heritage.