Large AI Models for Building Material Counting Task: A Comparative Study

Chen, Yutao; Li, Yang; Liu, Siyuan; Huang, Qian; Fan, Zekai; Chen, Jun

doi:10.3390/buildings15162900

Open AccessArticle

Large AI Models for Building Material Counting Task: A Comparative Study

by

Yutao Chen

¹,

Yang Li

²,

Siyuan Liu

¹,

Qian Huang

³,

Zekai Fan

³ and

Jun Chen

^3,*

¹

College of Civil Engineering and Architecture, Xinjiang University, Urumqi 830047, China

²

China United Engineering Co., Ltd., Hangzhou 310052, China

³

College of Civil Engineering, Tongji University, Shanghai 200092, China

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(16), 2900; https://doi.org/10.3390/buildings15162900

Submission received: 3 July 2025 / Revised: 13 August 2025 / Accepted: 14 August 2025 / Published: 15 August 2025

(This article belongs to the Special Issue The Application of Intelligence Techniques in Construction Materials)

Download

Browse Figures

Versions Notes

Abstract

The rapid advancement of general large models has significantly impacted and introduced new concepts to the traditional “one task, one model” research paradigm in construction automation. In this paper, we evaluate the performance of existing large models and those developed on large model platforms, using building material counting as an example. We compare three categories of large AI models for building material counting, including multimodal large models, purely visual large models, and secondary models developed on platforms. Through this research, we aim to explore the accuracy and practicality of these models in real-world construction scenarios. The results indicate that directly applying general large models faces challenges in processing photos with complex shapes or backgrounds, failing to provide accurate counting results. Additionally, while purely visual large models excel in instance segmentation tasks, their application to the specific counting of building materials requires additional programming work. To address these issues, this study explores solutions based on large model secondary development platforms and trains a model using EasyDL as an example. Leveraging deep learning techniques, this model achieves effective counting of building materials through five steps: data preparation, model type selection, model training, model validation, and model deployment. Although models developed based on large model platforms are presently less accurate than specialized models, they still represent a highly promising approach.

Keywords:

multimodal large models; construction material management; object detection; instance segmentation; EasyDL platform

1. Introduction

The construction industry plays a crucial role in global socio-economic development and urbanization processes [1]. Building materials such as steel bars, steel pipes, I-beams, channel steel, angle irons, and wooden beams form the foundation of construction projects and directly impact the safety performance and lifespan of buildings. Simultaneously, due to their diverse types and large quantities, building materials pose high demands on the precision and efficiency of material management. Therefore, enhancing the management level of building materials, particularly through the realization of intelligent identification and automated detection of construction materials [2], not only helps ensure project progress and quality, achieving cost reduction and efficiency improvement for enterprises, but also aligns with the trend of digital and intelligent transformation in the construction industry [3]. Currently, building material counting methods based on deep learning object detection technology have achieved significant results, especially in image analysis [4,5]. Nevertheless, common issues with all these methods include the following: (1) their heavy reliance on self-developed datasets of target building materials, which require extensive human and material resources for annotation; (2) the need for a dedicated database for each type of building material (such as rebar, steel pipes, I-beams, etc.); (3) the necessity for a specialized counting model for each material type due to its distinct cross-sectional characteristics, i.e., a “one material, one model” or “one task, one model” solution strategy. These issues result in the need to repeat the steps of data collection, annotation, model training, and deployment for each new material, making the process cumbersome and costly.

At the end of 2022, OpenAI launched the Chat Generative Pre-trained Transformer (ChatGPT) [6], a large pre-trained language model based on the Transformer architecture [7]. This model not only demonstrates powerful contextual understanding and high-quality text generation capabilities, but can also handle complex tasks such as code writing, logical reasoning, and literary creation [8]. Upon its release, the product quickly became a global sensation, highlighting the immense developmental potential of AI technology. It has since been widely applied in various vertical industries, including education, healthcare, finance, business operations, and everyday office work [9]. Additionally, supported by massive amounts of data and large-scale computing power, the model’s parameter count has grown significantly, allowing for more diverse input modalities. It has expanded from single-text information to multi-modal inputs, including text, images, audio, and video [10]. This comprehensive and rich information perception enables the model to understand user needs more accurately and deeply. Naturally, the following question arises: Can large AI models ultimately solve the problem of building material counting?

To answer this question, we aim to investigate the direct application of large AI models in building material counting tasks. By evaluating the performance of existing large AI models when directly applied to these tasks, we seek to determine whether large-scale models can replace the traditional “one task, one model” strategy. We compare the performance of different types of models: multimodal models, purely visual models, and secondary models developed on platforms. We hope this study will provide the construction industry with a reference guide for selecting appropriate AI technologies to improve material management efficiency. In this context, we organize the paper as follows. Section 2 summarizes the currently available large AI models. Section 3 records the application of these models to the task of building material counting and evaluates their performance. Section 4 discusses the development of new models based on the large AI models and presents the results. Finally, Section 5 concludes this paper with several important findings.

2. A Survey of Large AI Models

Currently, large AI models that are available for building material counting can be classified into three categories: multimodal large models, purely visual large models, and platforms on which secondary models can be developed. We investigated and compiled the three types of large models above as of 1 May 2024. Due to the rapid development in this field, there will certainly be newer models that are not included in the following summary tables.

2.1. Multimodal Large Models

A multimodal large model refers to a neural network model capable of handling multiple types of input and possessing billions or even more parameters. It can process various types of data, including text, audio, images, and videos. Unlike traditional deep learning models that can only handle a single type of data, multimodal large models can integrate different types of data, extracting and combining information to achieve better prediction and reasoning [11]. Table 1 summarizes twenty multimodal large models we have collected. The table also provides information on the names of these large models, their developers, release dates, parameter counts, main architectures, whether they are open source, web addresses, and their primary application areas.

2.2. Purely Visual Large Models

A purely visual large model focuses on processing and analyzing visual data, such as images and video materials. Leveraging its vast number of parameters, the model can capture and learn complex visual patterns, performing various visual tasks including image recognition, classification, object detection, semantic segmentation, pose estimation, and image generation [32]. The performance of a purely visual large model is highly dependent on annotated data. Only with training and optimization based on large amounts of high-quality data can the model achieve high-precision visual recognition and analysis. Table 2 shows five purely visual large models along with their names, developers, and web addresses.

2.3. Large Model Platforms

The abovementioned large models listed in Table 1 and Table 2 are typically more suitable for common scenarios and general tasks. Their practical effectiveness in specific industry-targeted scenarios might be limited due to factors such as the scarcity of publicly available data, the sensitivity of industry data, and insufficient local deployment capabilities. To address these challenges, several AI giants have launched platforms where specialized models can be developed by leveraging pre-trained large models, for example, Vertex AI [38] and EasyDL [39]. Vertex AI, launched by Google, is a comprehensive machine learning platform that integrates data engineering, data science, and machine learning engineering for training and deploying machine learning models and AI applications. EasyDL, launched by Baidu, is a low-code, user-friendly platform for training and serving AI models. It allows users to quickly build and deploy machine learning models through a simplified interface and minimal code. Because the EasyDL platform provides an end-to-end AI solution, covering the entire process from model design to deployment, we selected it as a representative platform to develop a secondary model for building material counting.

3. Performance Evaluation

3.1. Multimodal Large Models

We selected 10 top-ranked large models from those listed in Table 1 for material counting tests. As shown in Table 3, the final selection was primarily based on the ranking results from authoritative leaderboards, including Hugging Face’s Open LLM and Shanghai AI Laboratory’s OpenCompass [40,41,42]. These leaderboards provide comprehensive quantitative evaluations for large models across various dimensions such as knowledge, language understanding, reasoning, mathematics, and coding. While the rankings mainly focus on open-source models, we also included several well-known closed-source models due to their strong performance and significant impact on the industry. Additionally, this helps to understand the gap between open-source and closed-source models in the field of object detection.

To evaluate the performances of these selected models, we conducted a series of question-and-answer tests using photos of building materials, as shown in Figure 1. Each photo contains a bulk of certain building material, e.g., rebar, square steel pipes, I-beams and wooden beams. It is worth emphasizing that all the test photos were taken from real construction sites. Initially, we asked each model two questions: (1) What building material is in the photo? (2) How many building materials are there? We found that for materials with complex shapes or when presented in photos with complicated backgrounds, most models were unable to accurately answer the first question. Therefore, we standardized the question to focus on the quantity of known materials, such as “How many steel bars are in the photo”.

Notably, to better utilize the models’ capabilities, we also optimized prompting strategies such as Chain of Thought [43], few-shot [44], and AI-based [45] methods. However, in practical operations, we found that different images and tasks require continuously enriching the prompts and adjusting their order. This makes the process more tedious and significantly increases the time required for model inference, which goes against our aim of achieving rapid material counting. Therefore, we will only present the test results based on the most basic prompting strategies in this paper.

To ensure the reproducibility and consistency of the experiments, we set the “temperature” parameter for all models to 0, which guarantees that the same images produce identical results across multiple tests. Using this setting, we conducted tests on the photos, and the typical results are presented in Table 3. The test results indicate that even for photos where the materials are arranged in a relatively regular and orderly manner, all large models fail to provide accurate counting results.

3.2. Purely Visual Large Models

The purely visual large models cannot be directly used to count the number of building materials in a photo. Instead, we tested the ability of those models to segment objects from photos. Among the five purely visual large models in Table 2, we chose the Segment Anything Model (SAM) for this evaluation due to its state-of-the-art performance in instance segmentation tasks and its support for multiple interaction strategies.

Three different interaction strategies were adopted to test the segmentation performance of SAM: interactive point selection, interactive box selection, and automatic segmentation. The specific operations for each strategy are as follows: (1) Interactive Point Selection: Input photos and select interactive points on the target building materials; SAM is then guided to recognize and segment them; (2) Interactive Box Selection: Use a rectangular box to select an area containing the target building materials; the model will recognize and segment based on this selected area; (3) Automatic Segmentation: without relying on any manual interaction, the model automatically recognizes and segments the rebar, circular steel tubes, square steel pipes, I-beams, wooden beams, and wheel fasteners in the photos.

The segmentation results of SAM are shown in Table 4. The results using the click and box selection strategy are more similar to those of semantic segmentation, which treats stacked building materials as a whole to generate a mask output. In contrast, while the automatic segmentation technology can successfully distinguish independent masks for individual building materials stacked together, it tends to have a higher rate of missed errors when dealing with high-density arranged building materials.

The nature of these large purely visual models is object segmentation, without distinguishing between steel and other items. Although the segmentation results for many cases seem good by visual observation, additional programming work is still required to “identify” which are steel from the segmented data and then count their real numbers, resulting in a decrease in work efficiency.

4. Secondary Model Developed Based on EasyDL

Since in Section 3, multimodal large models and purely visual large models did not meet the expected performance in the task of counting building materials using images, this section will explore the development of secondary models using the EasyDL platform, aiming to test the actual effect of secondary developed models that adopt pre-trained models and transfer learning techniques with a small amount of data.

4.1. Brief Introduction of EasyDL

The development of deep learning models is a complex and tedious process, including, but not limited to, software environment setup, model selection, hyperparameter tuning, model training, and model deployment. To simplify the AI development process, Baidu launched the EasyDL platform in November 2017. This platform aims to lower the technical barrier and provide developers with integrated AI development services. EasyDL integrates core technologies from Baidu’s PaddlePaddle framework and offers a series of carefully selected pre-trained models. These models are trained on Baidu’s large-scale multimodal datasets and cover both visual and textual domains. Through transfer learning, the EasyDL platform can support effective model training even with limited data, achieving over 90% detection accuracy in some object detection tasks.

For researchers with experience in AI model development, we recommend the PaddlePaddle EasyDL Desktop Edition. This version provides two primary modeling approaches: adjusting preset model parameters and custom modeling using Notebooks. These features enable developers to adjust models based on their own experience, allowing them to obtain solutions which are better suited for specific application scenarios. Additionally, users can export trained models along with detailed configuration information during the deployment phase. Notably, the EasyDL Desktop Edition includes the ability to convert graphical user interface (GUI) operations into code. We have found that this capability not only enhances the reproducibility of the entire workflow, but also offers greater potential for subsequent optimization of model performance.

4.2. Development Workflow of Secondary Model Based on EasyDL

The following five steps are necessary to develop a specific model based on EasyDL.

Step 1: Data preparation. To start a new job, users can create a dataset and upload data to it. EasyDL also provides an automatic data augmentation function to perturb and expand the data to some extent.

Step 2: Model type selection. Currently, the EasyDL Image Processing category supports the development of three types of secondary AI models: image classification, object detection, and image segmentation. Here, we chose to create an “Object Detection” model for counting building materials. For fair comparison purposes, we utilized the same dataset as Li et al. [46], which contains 74,824 rebar sections from real construction sites, without any data augmentation.

Step 3: Model training. Users can select from a variety of pre-trained models offered by EasyDL, tailored to meet specific application requirements such as accuracy and inference speed. The model can be trained using either the default parameters or customized settings, including learning rate, batch size, and number of iterations. Throughout the training process, users can monitor real-time performance metrics, such as accuracy, recall, and loss function values, to track the model’s progress.

Step 4: Model validation. Users can assess the model’s performance through the final evaluation report or validation process. If the results are unsatisfactory, users have the option to iterate on the model by expanding the dataset or refining annotations to improve performance.

Step 5: Model deployment. Upon completing the online model training and validation, users can deploy the model by choosing a customized method, such as public cloud deployment, local server deployment, deployment on general-purpose small devices, integrated hardware–software solutions, or deployment via browsers or mini-programs. Once the model is published, it becomes accessible to other users through an API connection.

4.3. Performance Comparison for Rebar Detection

Li et al. [46] proposed a rebar real-time detection model based on YOLOv3 in their paper published in “Automation in Construction” [46]. They implemented three key improvements to the original YOLOv3 network: (1) adding an extra feature pyramid network to enhance the recognition capability for objects of different scales; (2) utilizing a combination of Intersection over Union (IoU) loss and focal loss functions to optimize localization accuracy and classification performance; and (3) integrating a series of performance enhancement strategies (bag of freebies) without additional cost. These enhancements resulted in an average precision of 99.71% for rebar detection at IoU = 0.5.

For fair comparison purposes, this study utilized the same dataset as Li et al. [46] without any data augmentation, which contains 74,824 rebar sections from real construction sites. Moreover, we adopted a standard training method and a high-performance general algorithm for model training in EasyDL. The configuration utilized default settings, with automatic hyperparameter search and advanced training options disabled. The training environment consisted of a single Tesla GPU P4 with 8 GB of memory, a 12-core CPU, and 40 GB of RAM, providing a total computing power of 5.5 TeraFLOPS. Using the same dataset, the model trained on the EasyDL platform achieved the highest accuracy of 90.96% with an IoU of 0.5 (Figure 2). Table 5 presents examples of the model on the validation set, where correctly identified rebars are marked with a green mask, incorrectly identified rebars with a red mask, and missed rebars with an orange mask. The detection results of Li’s model are also shown. It is clear that Li’s model performs better than the secondary AI model that we developed using EasyDL.

One interesting feature of EasyDL is that it provides specific optimization suggestions to further improve the model’s performance. For the rebar detection model, EasyDL first identified the affected performance metrics and their importance levels and conducted an in-depth root cause analysis, pointing out that color bias, saturation changes, and target box size were the main factors affecting model accuracy and causing false detections. These factors were quantified within different characteristic ranges and corresponding optimization strategies were proposed in Table 6. For instance, it suggests adjusting the color processing methods in the data augmentation strategy (e.g., using “Color, Posterize” and simple “Color” adjustments) to mitigate color bias and saturation issues. It also suggests exploring higher precision models, implementing small object detection techniques, or using Business Model Logic (BML) for more optimization strategies to reduce accuracy fluctuations caused by target size. After adopting data augmentation strategies and training with a higher precision model, the model’s AP improved to 93.36%, a 2.46% increase compared to before. This indicates that the optimization suggestions provided by EasyDL are effective, enabling the model to achieve better performance in rebar detection tasks.

4.4. Performance Comparison for Other Building Materials

Following the same procedure, we developed secondary AI models based on EasyDL to count various building materials, including circular steel tubes, square steel tubes, I-beams, wooden beams, and wheel fasteners. For each material, datasets collected by the author’s group were used, and the performance of the final models was compared with those developed using the YOLO framework by the author’s group on the same datasets [47,48]. Table 7 illustrates some examples. For the five materials considered here, the EasyDL models demonstrate an AP50 (Average Precision at IoU = 0.50) accuracy spanning from 86.8% to 98.8%, whereas the models developed by the author’s team exhibit an accuracy range of 91.4% to 99.4%. The comparison indicates that AI counting models specifically developed for each type of building material outperform the models secondarily developed based on EasyDL. Nevertheless, the secondary developed models also demonstrated high counting accuracy and convenient development and deployment efficiency.

4.5. Discussion

In this section, we developed and trained building material counting models on the EasyDL platform using large-scale AI models. While our experimental results show that these models demonstrate a certain level of generalization and adaptability, there is still room for improvement in terms of accuracy for specific application scenarios. The following sections will discuss the strengths, practical significance, and limitations of this research.

4.5.1. Advantages and Practical Significance of the Research

Our research found that using pre-trained large-scale models for transfer learning is highly effective. This is especially true in situations where construction site conditions are complex, and it is difficult to collect a large number of building material data samples that meet training requirements. By applying transfer learning techniques, we were able to achieve high model performance even with limited data. This approach offers a more convenient and efficient solution for scenarios lacking large training datasets and AI expertise.

Additionally, by utilizing cloud-based platforms, researchers can focus on algorithm optimization without dealing with the complexities and costs of hardware infrastructure and maintenance. This approach not only accelerates the research and development process but also simplifies model updates. Consequently, in industries like construction, where rapid adaptation to market changes is crucial, we believe this cloud-based development model offers significant potential.

4.5.2. Limitations and Future Directions

Despite the model’s excellent scalability and ease of use, its accuracy may not meet the required standards when confronted with common machine vision challenges, such as occlusion or poor lighting conditions. Furthermore, reliance on existing datasets limits the system’s generalization ability, particularly when encountering new types of building materials, leading to poor recognition performance. To address these challenges, future research could focus on the following areas:

Data Augmentation and Quality Improvement: Increase the diversity of samples to cover different types of building materials and ensure sufficient sample sizes for each category. Additionally, data preprocessing techniques, such as applying data augmentation, could be improved to enhance the model’s robustness in diverse conditions.
Model Architecture Innovation: Explore new network architectures better suited for the task of counting building materials. For instance, incorporating attention mechanisms and other advanced feature extraction techniques could improve the accuracy of key region identification.
Automatic Online Updates: Utilize new data collected from the deployed models to continuously update and refine the existing model, thereby gradually improving its prediction accuracy.

4.5.3. Ethical and Operational Risks Reminders

Although secondary transfer learning based on large pre-trained models can significantly improve modeling efficiency, two types of risks should be noted when deploying on construction sites:

Data leakage risks: Original images uploaded to cloud platforms may contain project-sensitive information. It is recommended to blur or crop QR codes, project nameplates, background buildings, etc., in the images before uploading.
Over-reliance risks: Large models still have missed detections and misjudgments in counting regular materials such as steel bars and wooden beams. Therefore, the model output results must undergo manual spot checks and reviews by on-site material staff. Especially in key business links involving settlement and payment, a “model initial counting, manual verification, difference tracing” closed-loop process should be established to ensure the accuracy of quantities.

5. Concluding Remarks

In this paper, we collected as many currently available deep learning large models as possible, tested their effectiveness in the problem of counting building materials, and evaluated the performance of counting models developed based on these large models. Our study indicates that although existing large models exhibit strong adaptability for multiple materials, there are still significant deficiencies in counting accuracy, making them unsuitable for direct application. The accuracy of the secondary counting model developed based on EasyDL is slightly lower than that of meticulously fine-tuned specialized models, but it demonstrates excellent scalability, updatability, development, and deployment efficiency, showing potential to become the main modeling approach in the future. Additionally, the current research mainly focuses on the accuracy evaluation of image-based counting. In future work, we will explore the integration of multimodal information, such as text, images, audio, and video, to further enhance the accuracy of building material counting.

Author Contributions

Conceptualization, J.C. and Y.C.; methodology, Y.L., J.C. and Y.C.; software, S.L., Q.H. and Z.F.; validation, S.L., Y.C., Q.H. and Z.F.; formal analysis, Y.C. and J.C.; investigation, Y.L., Y.C. and Q.H.; writing—original draft preparation, Y.C.; writing—review and editing, J.C. and Y.L.; supervision, J.C. and Y.L.; funding acquisition, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 52178151) and Tongji University Cross-Discipline Joint Program (2022-3-YB-06).

Data Availability Statement

Some or all data, models, or codes supporting the findings of this study are available from the corresponding authors upon reasonable request. The datasets are available online at https://github.com/H518123 (accessed on 12 August 2025) after the paper is published.

Conflicts of Interest

Author Yang Li was employed by the company China United Engineering Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

You, Z. Intelligent construction: Unlocking opportunities for the digital transformation of China’s construction industry. Eng. Constr. Archit. Manag. 2024, 31, 1429–1453. [Google Scholar] [CrossRef]
Baduge, S.K.; Thilakarathna, S.; Perera, J.S.; Arashpour, M.; Sharafi, P.; Teodosio, B.; Shringi, A.; Mendis, P. Artificial intelligence and smart vision for building and construction 4.0: Machine and deep learning methods and applications. Autom. Constr. 2022, 141, 104440. [Google Scholar] [CrossRef]
Pan, Y.; Zhang, L. Roles of artificial intelligence in construction engineering and management: A critical review and future trends. Autom. Constr. 2021, 122, 103517. [Google Scholar] [CrossRef]
Liu, H.; Wang, D.; Xu, K.; Zhou, P.; Zhou, D. Lightweight convolutional neural network for counting densely piled steel bars. Autom. Constr. 2023, 146, 104692. [Google Scholar] [CrossRef]
Shin, Y.; Heo, S.; Han, S.; Kim, J.; Na, S. An image-based steel rebar size estimation and counting method using a convolutional neural network combined with homography. Buildings 2021, 11, 463. [Google Scholar] [CrossRef]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Floridi, L.; Chiriatti, M. GPT-3: Its nature, scope, limits, and consequences. Minds Mach. 2020, 30, 681–694. [Google Scholar] [CrossRef]
Yenduri, G.; Ramalingam, M.; Selvi, G.C.; Supriya, Y.; Srivastava, G.; Maddikunta, P.K.R.; Raj, G.D.; Jhaveri, R.H.; Prabadevi, B.; Wang, W.; et al. GPT (Generative Pre-Trained Transformer)—A comprehensive review on enabling technologies, potential applications, emerging challenges, and future directions. IEEE Access 2024, 12, 54608–54649. [Google Scholar] [CrossRef]
Wu, T.; He, S.; Liu, J.; Sun, S.; Liu, K.; Han, Q.-L.; Tang, Y. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA J. Autom. Sin. 2023, 10, 1122–1136. [Google Scholar] [CrossRef]
Xu, P.; Zhu, X.; Clifton, D.A. Multimodal learning with transformers: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12113–12132. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.; Wang, W.; Tian, H.; Ye, S.; Gao, Z.; Cui, E.; Tong, W.; Hu, K.; Luo, J.; Ma, Z.; et al. How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites. Sci. China Inf. Sci. 2024, 67, 220101. [Google Scholar] [CrossRef]
Guo, Z.; Xu, R.; Yao, Y.; Cui, J.; Ni, Z.; Ge, C.; Chua, T.-S.; Liu, Z.; Huang, G. LLaVA-UHD: An LMM Perceiving Any Aspect Ratio and High-Resolution Images. In Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15141, pp. 390–406. [Google Scholar] [CrossRef]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Bai, J.; Bai, S.; Yang, S.; Wang, S.; Tan, S.; Wang, P.; Lin, J.; Zhou, C.; Zhou, J. Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv 2023, arXiv:2308.12966. [Google Scholar]
Shan, B.; Yin, W.; Sun, Y.; Tian, H.; Wu, H.; Wang, H. ERNIE-ViL 2.0: Multi-view contrastive learning for image-text pre-training. arXiv 2022, arXiv:2209.15270. [Google Scholar]
Wang, W.; Lv, Q.; Yu, W.; Hong, W.; Qi, J.; Wang, Y.; Ji, J.; Yang, Z.; Zhao, L.; Xuan, S. CogVLM: Visual expert for pretrained language models. Adv. Neural Inf. Process. Syst. 2024, 37, 121475–121499. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
Li, Z.; Yang, B.; Liu, Q.; Ma, Z.; Zhang, S.; Yang, J.; Sun, Y.; Liu, Y.; Bai, X. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv 2023, arXiv:2311.06607. [Google Scholar]
Girdhar, R.; El-Nouby, A.; Liu, Z.; Singh, M.; Alwala, K.V.; Joulin, A.; Misra, I. Imagebind: One embedding space to bind them all. arXiv 2023, arXiv:2305.05665. [Google Scholar] [CrossRef]
Gemini Team Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar] [CrossRef]
AI. Yi: Open foundation models by 01.AI. arXiv 2024, arXiv:2403.04652.
Hu, J.; Yao, Y.; Wang, C.; Wang, S.; Pan, Y.; Chen, Q.; Yu, T.; Wu, H.; Zhao, Y.; Zhang, H.; et al. Large multilingual models pivot zero-shot multimodal learning across languages. arXiv 2023, arXiv:2308.12038. [Google Scholar]
XVERSE-V-13B. Available online: https://github.com/xverse-ai/XVERSE-V-13B (accessed on 3 July 2025).
Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv 2023, arXiv:2301.12597. [Google Scholar]
Huo, Y.; Zhang, M.; Liu, G.; Lu, H.; Gao, Y.; Yang, G.; Wen, J.; Zhang, H.; Xu, B.; Zheng, W.; et al. WenLan: Bridging vision and language by large-scale multi-modal pre-training. arXiv 2021, arXiv:2103.06561. [Google Scholar]
Hunyuan AI. Available online: https://hunyuan.tencent.com (accessed on 3 July 2025).
Sun, Q.; Wang, J.; Yu, Q.; Cui, Y.; Zhang, F.; Zhang, X.; Wang, X. EVA-CLIP-18B: Scaling CLIP to 18 billion parameters. arXiv 2024, arXiv:2402.04252. [Google Scholar]
Ye, Q.; Xu, H.; Ye, J.; Yan, M.; Hu, A.; Liu, H.; Qian, Q.; Zhang, J.; Huang, F.; Zhou, J. mPLUG-Owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv 2023, arXiv:2311.04257. [Google Scholar]
Xinghuo AI. Available online: https://xinghuo.xfyun.cn (accessed on 3 July 2025).
The Claude 3 Model Family: Opus, Sonnet, Haiku. Available online: https://www.anthropic.com/news/claude-3-haiku?ref=ai-recon.ghost.io (accessed on 3 July 2025).
Wang, J.; Liu, Z.; Zhao, L.; Wu, Z.; Ma, C.; Yu, S.; Dai, H.; Yang, Q.; Liu, Y.; Zhang, S.; et al. Review of large vision models and visual prompt engineering. Meta-Radiology 2023, 1, 100047. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment anything (SA) project: A new task, model, and dataset for image segmentation. arXiv 2023, arXiv:2304.02643. [Google Scholar]
Ke, L.; Ye, M.; Danelljan, M.; Liu, Y.; Tai, Y.-W.; Tang, C.-K.; Yu, F. Segment anything in high quality. arXiv 2023, arXiv:2306.01567. [Google Scholar] [CrossRef]
Zhao, X.; Ding, W.; An, Y.; Du, Y.; Yu, T.; Li, M.; Tang, M.; Wang, J. Fast segment anything. arXiv 2023, arXiv:2306.12156. [Google Scholar] [CrossRef]
Wang, X.; Zhang, X.; Cao, Y.; Wang, W.; Shen, C.; Huang, T. SegGPT: Segmenting everything in context. arXiv 2023, arXiv:2304.03284. [Google Scholar] [CrossRef]
Zou, X.; Yang, J.; Zhang, H.; Li, F.; Li, L.; Wang, J.; Wang, L.; Gao, J.; Lee, Y.J. Segment everything everywhere all at once. arXiv 2023, arXiv:2304.06718. [Google Scholar] [CrossRef]
Google Cloud Vertex AI. Available online: https://cloud.google.com/vertex-ai (accessed on 3 July 2025).
Baidu AI EasyDL. Available online: https://ai.baidu.com/easydl (accessed on 3 July 2025).
Hugging Face Open LLM Leaderboard. Available online: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard (accessed on 3 July 2025).
Shanghai AI Laboratory OpenCompass Leaderboard. Available online: https://rank.opencompass.org.cn/home (accessed on 3 July 2025).
Zhang, D.; Yu, Y.; Dong, J.; Li, C.; Su, D.; Chu, C.; Yu, D. MM-LLMs: Recent advances in multimodal large language models. arXiv 2024, arXiv:2401.13601. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar]
Wang, Z.; Sun, Q.; Zhang, B.; Wang, P.; Zhang, J.; Zhang, Q. PM2: A new prompting multi-modal model paradigm for few-shot medical image classification. arXiv 2024, arXiv:2404.08915. [Google Scholar]
Automatically Generate First Draft Prompt Templates Anthropic. Available online: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prompt-generator (accessed on 3 July 2025).
Li, Y.; Lu, Y.; Chen, J. A deep learning approach for real-time rebar counting on the construction site based on YOLOv3 detector. Autom. Constr. 2021, 124, 103602. [Google Scholar] [CrossRef]
Chen, J.; Chen, W.; Li, Y. Intelligent real-time counting of construction materials based on object detection. J. Tongji Univ. (Nat. Sci. Ed.) 2023, 51, 1701–1710. [Google Scholar] [CrossRef]
Chen, J.; Huang, Q.; Chen, W.; Li, Y.; Chen, Y. Automated counting of steel construction materials: Model, methodology, and online deployment. Buildings 2024, 14, 1661. [Google Scholar] [CrossRef]

Figure 1. Test images of different types of construction materials: (a) cross-sections of rebars, (b) cross-sections of square steel pipes, (c) cross-sections of I-beams, (d) cross-sections of wooden beams.

Figure 2. F1-score performance under different IoU thresholds.

Table 1. Comparison of popular multimodal large artificial intelligence models.

Model Name	Developer	Release Date	Training Dataset	Number of Parameters (Billion)	Main Architecture	Implementation Strategy	Open Source Status	Application Fields	URL(Uniform Resource Locator)
InternVL-Chat-V1.5 [12]	Shanghai AI Laboratory	2024.04	High-quality bilingual dataset covering common scenes and document images	25.5	InternViT-6B-448 px-V1-5 + MLP + InternLM2-Chat-20B	Pretraining stage: ViT + MLP and Supervised fine-tuning stage: ViT + MLP + LLM	Open Source	Visual question answering, Character recognition, Real-world understanding	https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5, accessed on 12 August 2025.
LLaVA-UHD [13]	Tsinghua University	2024.03	CC-595 K and 656 K mixture dataset	N/A	CLIP-ViT-L/14 + Perceiver Resampler + Vicuna-13B	Modularized Visual Encoding + Compression of Visual Tokens + Spatial Schema	Open Source	Visual question answering, Optical character recognition, Vision-language understanding tasks	https://github.com/thunlp/LLaVA-UHD, accessed on 12 August 2025.
GPT-4 [14]	OpenAI	2023.03	Web pages, books, articles, and conversations	Exceeds GPT-3′s 175	Transformer + Mixture of Experts (MoE) + Self-Attention	Large-scale pre-training + Fine-tuning	Closed Source	Natural language processing, Dialogue systems, Content generation	https://chatgpt.com, accessed on 12 August 2025.
Qwen-VL [15]	Alibaba Cloud	2023.08	Multilingual multimodal cleaned corpus	9.6	ViT + Qwen-7B + Vision-Language Adapter	Multi-task Pretraining + Supervised Fine-tuning	Open Source	Image captioning, Visual question answering, Grounding	https://github.com/QwenLM/Qwen, accessed on 12 August 2025.
ERNIE-ViL 2.0 [16]	Baidu	2022.09	29 M public datasets (English), 1.5B Chinese image-text pairs	N/A	EfficientNet-L2 + BERT-large + ViT + ERNIE	Multi-View Contrastive Learning Framework	Open Source	Cross-modal retrieval, Visual question answering, Multimodal representation learning	https://github.com/PaddlePaddle/ERNIE, accessed on 12 August 2025.
CogVLM [17]	Tsinghua University, Zhipu AI	2024.01	LAION-2B, COYO-700 M, and a visual grounding dataset of 40 M images	17	ViT + MLP + GPT + Visual Expert Module	Trainable visual expert module to deeply fuse vision and language features	Open Source	Image captioning, Visual question answering, Visual grounding	https://github.com/THUDM/CogVLM, accessed on 12 August 2025.
CLIP [18]	OpenAI	2021.01	400 million (image, text) pairs collected from the internet	N/A	ViT + ResNet + Transformer	Contrastive Learning + Pre-training	Open Source	Zero-shot transfer to various computer vision tasks	https://github.com/OpenAI/CLIP, accessed on 12 August 2025.
Monkey [19]	Huazhong University, Kingsoft	2023.11	19 different datasets, including 1.44 million samples	9.8	ViT-BigG + Qwen-VL + LoRA	Multi-level description generation method + sliding window method	Open Source	Image Captioning, Visual question answering, Document-oriented visual question answering	https://github.com/Yuliang-Liu/Monkey, accessed on 12 August 2025.
IMAGEBIND [20]	Meta AI	2023.05	Image-text, video-audio, image-depth, image-thermal, video-IMU	N/A	All modality encoders based on Transformer	Leverages image-paired data for joint embedding across six modalities	Open Source	Cross-modal retrieval, Zero-shot recognition, Few-shot recognition	https://facebookresearch.github.io/ImageBind, accessed on 12 August 2025.
Gemini 1.5 Pro [21]	Google	2024.02	Multimodal and multilingual data	N/A	Sparse mixture-of-expert (MoE) Transformer	JAX + ML Pathways + TPUv4 Distributed Training	Closed Source	Multilingual translation, Multimodal long-context understanding	https://gemini.google.com/app, accessed on 12 August 2025.
Yi-VL [22]	01.AI	2023.11	3.1 T tokens of English and Chinese corpora	6/34	Transformer + Grouped-Query Attention + SwiGLU + RoPE	Train ViT and projection + Train higher resolution image + Joint train all modules	Open Source	Language modeling, vision-language tasks, long-context retrieval, chat models	https://huggingface.co/01-ai, accessed on 12 August 2025.
MiniCPM-V-2_5 [23]	OpenBMB	2024.05	SigLip-400 M	8	Improved version based on Llama3-8B-Instruct	LoRA fine-tuning + GGUF format + quantization + NPU optimization	Open Source	Multilingual support, Mobile deployment, Multimodal tasks	https://github.com/OpenBMB/MiniCPM-V, accessed on 12 August 2025.
XVERSE-V-13B [24]	Shenzhen Yuanxiang	2024.04	2.1 billion pairs of images-text and 8.2 million instruction data points	13	Clip-vit-large-patch14-224 + MLP + XVERSE-13B-Chat	A large-scale multimodal pre-training + Fine-tuning	Open Source	Visual question answering, Character recognition, Real-world understanding	https://www.modelscope.cn/models/xverse/XVERSE-V-13B, accessed on 12 August 2025.
BLIP-2 [25]	Salesforce Research	2023.01	Large dataset of 129 M images	0.188 (Q-Former)	Q-Former + Frozen Image Encoder + Frozen LLMs	Vision-language representation learning and generative learning	Open Source	Visual question answering, Image captioning, Image-Text retrieval	https://github.com/salesforce/LAVIS/tree/main/projects/blip2, accessed on 12 August 2025.
BriVL [26]	Renmin University of China, Chinese Academy of Sciences	2021.07	30 million image-text pairs	1	Two-tower architecture, including text encoder and image encoder	Contrastive learning enhanced with MoCo for managing large negative sample sets efficiently	Open Source	Image-text retrieval, Captioning, Visual understanding	https://github.com/BAAI-WuDao/BriVL, accessed on 12 August 2025.
Hunyuan [27]	Tencent	2023.09	N/A	N/A	N/A	N/A	Closed Source	Content creation, Logical reasoning, Multimodal interaction	https://hunyuan.tencent.com, accessed on 12 August 2025.
EVA-CLIP-18B [28]	Beijing Academy of Artificial Intelligence	2024.02	Merged-2B, LAION-2B, COYO-700 M, LAION-COCO, Merged-video	18	Based on CLIP, utilizing both vision and language components.	Weak-to-strong vision scaling + RMSNorm and LAMB optimizer	Open Source	Image classification, Video classification, Image-text retrieval	https://github.com/baaivision/EVA/tree/master/EVA-CLIP-18B, accessed on 12 August 2025.
mPLUG-Owl2 [29]	Alibaba	2023.11	400 M image-text pairs	8.2	ViT-L + LLaMA-2-7B	Modality-adaptive modular network + Pre-trained with joint tuning	Open Source	Multi-modal tasks, Vision-language tasks, Pure-text tasks	https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2, accessed on 12 August 2025.
IFlytek Spark V3.5 [30]	iFLYTEK	2024.01	N/A	N/A	N/A	N/A	Closed Source	Multilingual capability, Knowledge-based question answering, Text generation	https://xinghuo.xfyun.cn, accessed on 12 August 2025.
Claude 3 Family [31]	Anthropic	2024.03	Proprietary mix of public and non-public data as of Aug 2023	N/A	N/A	Utilizes various training methods including unsupervised learning and Constitutional AI	Closed Source	Reasoning, Coding, Multi-lingual Understanding	https://docs.anthropic.com, accessed on 12 August 2025.

Table 2. Comparison of purely visual large artificial intelligence models.

Model Name	Developer	Release Date	Number of Parameters (Million)	URL (Uniform Resource Locator)
SAM [33]	Meta AI	2023.04	68	https://github.com/facebookresearch/segment-anything, accessed on 12 August 2025.
HQ-SAM [34]	ETH Zürich, HKUST	2023.10	Slight increase over SAM	https://github.com/SysCV/SAM-HQ, accessed on 12 August 2025.
FastSAM [35]	Chinese Academy of Sciences	2023.06	68	https://github.com/CASIA-IVA-Lab/FastSAM, accessed on 12 August 2025.
SegGPT [36]	Beijing Academy of Artificial Intelligence	2023.04	150	https://github.com/baaivision/Painter, accessed on 12 August 2025.
SEEM [37]	Microsoft research	2023.12	N/A	https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once, accessed on 12 August 2025.

Table 3. Test results of multimodal large artificial intelligence models for building material counting.

Model Name	Rebars Counting Test Results (Real number: 85)	Square Steel Pipes Counting Test Results (Real number: 60)	I-beams Counting Test Results (Real number: 60)	Wooden Beams Counting Test Results (Real number: 60)
GPT-4
ERNIE Bot
Qwen
GLM-4
mPLUG
Spark Desk
Gemini
Hunyuan
Claude3
MiniCPM

Table 4. Test results of various interaction methods based on the SAM.

Interaction Strategy	Rebars Bars	Circular Steel Tubes	Square Steel Pipes	I-beams	Wooden Beams	Wheel Fasteners
Interactive point selection
Interactive box selection
Automatic segmentation

Table 5. Comparison of rebar detection results from different models.

Sample Image	EasyDL Model	Li’s Model
(Real number: 164)	(False: 4, Missed: 10)	(False: 0, Missed: 0)
(Real number: 63)	(False: 3, Missed: 0)	(False: 0, Missed: 0)
(Real number: 189)	(False: 5, Missed: 2)	(False: 0, Missed: 0)
(Real number: 243)	(False: 1, Missed: 13)	(False: 0, Missed: 0)

Table 6. Optimization suggestions provided by EasyDL model.

No.	Affected Metric	Impact Level	Root Cause Analysis	Optimization Strategy
1	Accuracy	High	Color bias has a significant impact on accuracy, with a variance of 0.0127 across different feature ranges.	Configure “Color, Posterize” in [Add Data] → [Data Augmentation Strategy] for enhancement.
2	Accuracy	High	Color bias has a significant impact on miss rate, with a variance of 0.0127 across different feature ranges.	Configure “Color, Posterize” in [Add Data] → [Data Augmentation Strategy] for enhancement.
3	Accuracy	High	Saturation has a significant impact on accuracy, with a variance of 0.0123 across different feature ranges.	Configure “Color” in [Add Data] → [Data Augmentation Strategy] for enhancement.
4	Accuracy	High	Saturation has a significant impact on miss rate, with a variance of 0.0123 across different feature ranges.	Configure “Color” in [Add Data] → [Data Augmentation Strategy] for enhancement.
5	Accuracy	High	Target box size has a significant impact on accuracy, with a variance of 0.0116 across different feature ranges.	Try higher-precision models, or try small object detection or more optimization strategies in Baidu Machine Learning (BML).

Table 7. Comparison of other building materials detection results between different models.

Sample Image	EasyDL Model	Other’s Model
(Real number: 98)	(False: 1, Missed: 12)	(False: 0, Missed: 0)
(Real number: 130)	(AP50: 86.85%) (False: 4, Missed: 8) (AP50: 90.54%)	(AP50: 93.01%) (False: 0, Missed: 0) (AP50: 91.41%)
(Real number: 55)	(False: 4, Missed: 0) (AP50: 97.68%)	(False: 0, Missed: 0) (AP50: 99.48%)
(Real number: 99)	(False: 8, Missed: 2) (AP50: 96.63%)	(False: 0, Missed: 0) (AP50: 97.06%)
(Real number: 36)	(False: 1, Missed: 1) (AP50: 98.81%)	(False: 0, Missed: 0) (AP50: 99.40%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Li, Y.; Liu, S.; Huang, Q.; Fan, Z.; Chen, J. Large AI Models for Building Material Counting Task: A Comparative Study. Buildings 2025, 15, 2900. https://doi.org/10.3390/buildings15162900

AMA Style

Chen Y, Li Y, Liu S, Huang Q, Fan Z, Chen J. Large AI Models for Building Material Counting Task: A Comparative Study. Buildings. 2025; 15(16):2900. https://doi.org/10.3390/buildings15162900

Chicago/Turabian Style

Chen, Yutao, Yang Li, Siyuan Liu, Qian Huang, Zekai Fan, and Jun Chen. 2025. "Large AI Models for Building Material Counting Task: A Comparative Study" Buildings 15, no. 16: 2900. https://doi.org/10.3390/buildings15162900

APA Style

Chen, Y., Li, Y., Liu, S., Huang, Q., Fan, Z., & Chen, J. (2025). Large AI Models for Building Material Counting Task: A Comparative Study. Buildings, 15(16), 2900. https://doi.org/10.3390/buildings15162900

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large AI Models for Building Material Counting Task: A Comparative Study

Abstract

1. Introduction

2. A Survey of Large AI Models

2.1. Multimodal Large Models

2.2. Purely Visual Large Models

2.3. Large Model Platforms

3. Performance Evaluation

3.1. Multimodal Large Models

3.2. Purely Visual Large Models

4. Secondary Model Developed Based on EasyDL

4.1. Brief Introduction of EasyDL

4.2. Development Workflow of Secondary Model Based on EasyDL

4.3. Performance Comparison for Rebar Detection

4.4. Performance Comparison for Other Building Materials

4.5. Discussion

4.5.1. Advantages and Practical Significance of the Research

4.5.2. Limitations and Future Directions

4.5.3. Ethical and Operational Risks Reminders

5. Concluding Remarks

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI