1. Introduction
Highway bridge construction is an important element of road transport, and plays an increasingly significant role in the development of the transportation sector. In the actual bridge construction process, the complex operating on-site environments, large numbers of construction personnel, and irregular operation of equipment often lead to major safety accidents [
1], resulting in significant life and economic losses to societies and families [
2]. Therefore, the identification of workers, equipment, and the behavioral relationship between workers and equipment at bridge construction sites, and thus, the inference of the current construction scene, has important application value for construction safety prevention.
The earliest methods used for construction safety monitoring relied primarily on manual monitoring during construction and safety assessment after completion [
3]. However, owing to factors such as a wide working area, the large number of people on the construction site, and the complexity of the equipment used, a reliance only on manual point-to-point monitoring is often time-consuming and labor-intensive, and the monitoring results are prone to error.
Most current researchers use deep learning methods in artificial intelligence (AI) for safety monitoring during construction processes [
4], with a focus on target detection of construction workers wearing helmets and holding equipment [
5]. However, this method ignores the interrelationship between workers and construction objects, leading to a lack of early warning capability for safety monitoring when workers perform non-compliant construction operations.
In recent years, more researchers have focused on visual relationship detection in deep learning, which aims to determine the topological relationship between targets in a scene [
6,
7] and generate the triplet form of subject–predicate–object. This approach can more accurately represent and describe construction scene information and contextual relationships. R-CNN [
8] was used by VDR [
9] to obtain the target candidate frame, and the relationship likelihood score of the triplet was obtained by a visual model and a semantic model for relationship prediction. Different from VRD, VTransE [
10] was an end-to-end model that maps the visual features of targets into a low-dimensional relational space, using transfer vectors to represent the relationships between targets. The textual representation of subject/object was used by CAI [
11] as contextual information to establish a visual relationship detection model. Features are the basis of target identification, so more features are incorporated into the DR-Net model to count the occurrence probability of subjects, predicates, and objects by visual features, spatial structure features, and relational features [
12]. In order to better understand the relationship between targets, ViP-CNN [
13] was used to establish the association between subjects, predicates, and objects on visual features by passing information between different models at the same layer. Zoom-net [
14] was used for deep information transfer between local target features and global predicate relation features to the achieve deep integration of subjects and predicates. At present, visual relationship detection has been applied to a variety of image understanding tasks, such as image understanding in construction scenes. Wu et al. [
5] performed relationship detection between workers and equipment by obtaining the head pose and body orientation of the worker. Kim et al. [
15] reconstructed individual behaviors using object types of interactions between workers and equipment to improve construction scene identification. Xiong et al. [
16] applied visual relationship detection in construction to a video surveillance system, enabling further improvement with respect to the immediate effectiveness of construction safety warnings. The above methods are able to identify specific targets and interrelationships between targets in construction scenes, but fail to further realize scene identification and understanding on this basis, and thus cannot achieve automation and intelligence in safety monitoring during construction. In addition, owing to the relatively high complexity of construction scenes, it is easy to encounter the problem of missing and incorrect detection of targets.
Visual relationship detection fully presents all information in an image and solves the problem of object relationship fragmentation caused by using target detection algorithms alone. However, there are only a few applications of visual relationship detection in highway bridge construction. In order to achieve intelligent safety monitoring of the bridge construction process and to complete construction scene identification and understanding, this paper proposes a visual relationship-based method for construction scene identification on highway bridges. The method combined the construction characteristics of highway bridges, and is based on the idea of deep learning. In this method, scene identification rules are formulated according to the target features and interrelationships in the construction scenes, and a scene identification model is then built based on the rules to complete the textual output of key scene information. The main work of this paper is as follows:
(1) Selection of key construction scenes on bridges. There are numerous bridge construction processes. Therefore, in this study, five key construction scenes of a bridge were selected based on an analysis of its construction characteristics and construction process.
(2) Formulation of identification rules for key construction scenes on bridges. A feature is the basis of scene identification. This study examines the underlying features that can distinguish the categories of key construction scenes, and establishes a feature information table and a tree diagram for the identification of key construction scenes on highway bridges. On this basis, the identification rules under different construction scenes are formulated.
(3) Building an identification model for key construction scenarios on bridges. In the target detection module, a feature pyramid network (FPN) and color moments are introduced to perform the multiscale detection of targets and obtain construction personnel identity information, while reducing the rate of missing and incorrect detection of targets. In the visual relationship extraction module, feature vectors are introduced to connect subjects, objects, and predicates in construction scenes in order to determine the interaction relationship between targets. In the semantic conversion module, frequency baselines are introduced to count the number of predicates in the construction scene, and the probability distribution of construction personnel actions is then obtained. In the scene information fusion module, an image–text encoder is introduced to combine the image results with the detection results to obtain the correspondence between the images and text. In the scene identification results output module, a rule consistency matching strategy is introduced to match the detected feature results with the formulated rules, and the category information of key construction scenes of highway bridges is then obtained.
(4) Validation of scene identification method. Experimental validation was performed using a homemade key construction scene identification dataset on a highway bridge. In addition, the accuracy, precision, recall, and other evaluation indexes were used to evaluate the accuracy of the proposed scene identification method. Moreover, we performed a comparative analysis with other visual relationship-detection methods to prove the effectiveness of the proposed method.
5. Conclusions
The construction process of highway bridges is tedious, and site environments are complex; thus, the realization of bridge construction scene identification helps relevant departments to carry out safety control. Therefore, based on the idea of visual relationships, this paper proposes the identification method of key construction scenes on highway bridges. This method can provide automated intelligent monitoring during the construction process and provide more applications for visual relationship detection in bridge construction. Firstly, the characteristics of bridge construction are analyzed and five key construction scenes are selected as research objects. Then, the scene identification rules are formulated from the three aspects of construction personnel, construction equipment, and construction materials. Following this, the CSIN model is built: FPN and color moments are first introduced to obtain the image features of construction workers, and solve the problem of missing and incorrect detection of target; then, through the division of subject–predicate–object triplet and image-text coding, the semantic features and visual features of construction scene can be obtained; finally, the integration features are matched with the scene identification rules for consistency, and the category information of the construction scene is further obtained. Finally, the method in this paper is verified; the experimental results show that compared with other algorithms, the CSIN model obtained better results, especially on Recall@100.
Although the method in this paper has addressed the above problems, there are still two limitations. One is that the method is only experimentally validated in five key construction scenes, and research on other bridge construction scenes has not been carried out. The other is that the method involves fewer large equipment and construction materials, such as the lack of detection of large cranes, pile-driving machines, concrete, long bars, and other targets. Therefore, for the construction monitoring of different bridge types, such as girder bridges, arch bridges, rigid bridges, suspension bridges, cable-stayed bridges, and combined system bridges, it is necessary to further increase the identifiable elements in the construction scenes to enrich the bridge construction scene categories.
In our study, we found that the production of the dataset was time-consuming and laborious. In future work, we will combine efficient methods such as crowdsourcing labeling technology to produce targeted visual relationship detection datasets, so as to improve work efficiency. In addition, we will further optimize the CSIN model, combined with the relevant construction safety standards to realize the safety monitoring and safety assessment of bridge construction based on the existing methods. Thus, we will form a complete set of methods for intelligent monitoring and safety assessment of bridge construction, and extend it to other construction scenes.