Design and Implementation of Industrial Accident Detection Model Based on YOLOv4

.


Introduction
Manufacturing represents a massive share of the global economy.In most countries, workers engaged in manufacturing must comply with occupational safety rules.Each government enforces such laws institutionally.Violators of such regulations are punished and receive significant social attention.However, even with great institutional coercion or social interest, injuries or deaths at industrial sites still occur due to, e.g., the environments of each workplace, various industry-specific reasons, and insensitivity to safety concerns.Thus, accidents continue to occur despite being addressed through workplace environment management, policies, and safety management education encouragement [1,2].
The issue of industrial accidents is consistently raised in Korea.From 2012 to 2021, according to the data on industrial accidents and yearbooks published by the Korea Occupational Safety and Health Agency every quarter, as the number of workplaces and workers continued to increase, the number of injured people continued to rise [3].This can be regarded as a side effect of Korea's rapid growth [4,5].According to the data on accident rates per 100,000 workers published by the International Labor Organization, Korea's accident rate ranks the highest among Organization for Economic Co-operation and Development (OECD) countries, followed by Mexico, Turkey, the United States, and Lithuania [6].These data are shown in Figure 1 below.In Figure 1, the countries are shown in order of descending GDP based on World Bank 2022 data [7].The names of the countries are on the x-axis and the values in parentheses indicate the latest year in which the International Labor Organization accident rates were reported.Data are reported from 2021 for Mexico, Turkey, Lithuania, and South Korea and 2018 for the United States.Korea has the third-highest accident rate among the top 10 OECD countries.Figure 2 shows the number of injuries and deaths by industry from the 2022 data published by the Korea Occupational Safety and Health Agency [8].In Figure 2, the graph on the left shows the number of injured people by industry and the chart on the right shows the number of deaths by industry sector.In the industry categories, "Other" generally refers to wholesale and retail businesses, health and social welfare businesses, and food and lodging businesses as service industries; "Etc" refers to fishery, agriculture, finance, and insurance businesses.Both graphs show that the numbers of injuries and fatalities are higher in the manufacturing and construction industries than in other sectors.Figure 3 depicts the number of injuries and deaths according to the workplace size.Similar to Figure 2, Figure 3 is divided into left and right graphs, and the data are classified according to the workplace size (based on the number of workers working).Workplaces with less than 50 workers have an overwhelmingly high number of accidents and deaths.Businesses with less than 50 workers tend to have poorer environments than other workplaces, a lack of safety awareness among workers, educational activities, and institutional support [9][10][11].
As shown earlier, Korea has a higher industrial accident rate than other countries, and many of these accidents and deaths occur in manufacturing and construction industries and workplaces with less than 50 workers.Various studies and institutional supplements attempt to solve these problems [12][13][14][15][16][17][18].In Korea, the "Serious Accident Punishment Act" was implemented in 2022, making employers liable for industrial accidents [19][20][21][22].
This study analyzed common scenarios and disasters occurring in manufacturing workplaces.We standardized the data based on closed-caption television (CCTV) video to build an object-based detection model.Three workplaces in downtown Daejeon were selected as manufacturing workplaces.The data standardization for the video data was based on the guidelines for building a learning dataset provided by the Korea National Information Society Agency.You Only Look Once (YOLO)v4 was used as the object detection model, and the accuracy exceeded 95%.Thus, this approach could create safe working environments for workers, including those outside the manufacturing sector.

Related Works
This section describes the data labeling tool and object detection model used in this study.

Data Labeling Tool
In general, the events of workers and accidents occurring at industrial sites are recorded and monitored by the CCTVs installed at industrial sites.In this study, because data collected from the workplace filmed by CCTV are being used, it is necessary to identify which event has occurred and which object has appeared in one frame in the video.This task is called data labeling or annotation.Various labeling tools exist; Table 1 compares the functions of the three tools [23][24][25].As CVAT is web-based, labeling can be performed co-operatively.In addition, to consider the user's perspective, the UI or shortcut functions, labeling functions for automatically extracting frames or tracking objects, and the degree of difficulty of installation were investigated and displayed.

Object Detection Model
Various open object detection models have been proposed.Starting with the regionbased convolutional neural network (R-CNN) in 2014, Fast R-CNN, Faster R-CNN, Mask R-CNN, YOLO, and the single-shot detector (SSD) have subsequently been released.These models can be divided into single-stage and two-stage models [26][27][28][29][30][31][32][33][34][35][36][37].The twostage method is a method in which a separate module creates a proposal for a region of interest; it first finds an object in an image and then classifies it.The R-CNN family belongs to this category.In contrast, the single-stage method is a technique for simultaneously proposing a region of interest and organizing an object using a single-shot technique.The YOLO and SSD models belong to this group.Table 2 compares the models' image input size, accuracy, and frames per second (FPS).The accuracy values shown in Table 2 are based on an indicator called the average precision (AP) used in object detection.The AP uses the intersection over union (IoU), i.e., the intersection of the entire union of the correct object and the object observed by the model.If this IoU value is enormous, the model should measure more accurately because the intersection of the valid object and the object followed by the model should be more significant.The IoU value can be adjusted to calculate the precision and recall.The formulas for precision and recall are shown in Equation (1) below: The AP is obtained by calculating the precision and recall for each class using the above formulas and taking the average value for all categories.The IoU value of the AP is marked with a subscript; thus, the IoU value of the AP in Table 2 is set to 0.5.Overall, it can be seen that the two-stage model has a high AP but low FPS, whereas the single-stage model shows the opposite tendencies.YOLOv4 was ultimately selected for this study as a model with a high processing rate and high accuracy between the FPS and AP.
YOLOv4 refers to the fourth version of the object detection model following the YOLO model.It uses a backbone network combining a cross-stage partial (CSP) technique and the Darknet53 used in previous versions.The CSP technique reduces the number of computations by preventing the reuse of gradient values [38].In addition, a spatial pyramid pooling (SPP) technique and the PANet technique were mixed.The SPP technique performs more accurate feature extraction by extracting features from an image as a spatial pyramid [39].The PANet technique up-samples feature extraction filters or images, down-samples them, and then aggregates them [40].This makes it possible to extract more diverse features.YOLOv4 employs these techniques to overcome the difficulties in detecting small objects in previous versions of YOLO problems.To compare YOLOv4 to other models, we experimented with the performance of SSD+MobileNetV2 and CenterNet [33,34,[41][42][43].The two models compared are one-stage methods similar to YOLOv4.The one-stage approach simplifies the structure of the model to allow for fast object tracking.SSD is a network using MobileNetV2 for the backbone network, and CenterNet is a model using key-point estimation.The difference between these models is that YOLOv4 and SSD track based on boxes, while CenterNet uses centerpoint key points.
We compared the mean average precision (mAP) values of the three models and selected the highest model.The mAP is calculated by plotting Equation (1) above with precision on the x-axis and recall on the y-axis and dividing the area of the graph by the number of objects to be detected.When the same indoor image was tested under the same conditions, the results are shown in Table 3 below, and the final YOLOv4 was selected [43].

Design and Implementation
To implement the industrial accident risk detection model, hazardous situations are derived from industrial sites to select the objects to be detected.The training dataset for learning the risk detection model is standardized in the JavaScript Object Notation (JSON) data format, and the final model is built by conducting experiments to detect objects.The model implementation process is shown in Figure 5.

Scenario Derivation and Object Selection
A risk scenario must be derived before identifying a hazardous situation.Before that, a workplace must be selected; then, the scenario derivation, object selection, and learning data collection can proceed.In this study, the business site comprises small-and mediumsized businesses in the manufacturing field in Daejeon, Korea.The Korean government has established these businesses as certified manufacturers related to industrial safety and worker welfare.Several industrial safety certifications are provided by the Korean government and, among them, this study considers manufacturers certified as standard workplaces for persons with disabilities and as having clean workplace systems.
The standard workplace for disabled people is a system supported by the Korea Employment Agency for Persons with Disabilities.This system creates an appropriate physical and emotional environment by presenting environmental standards for disabled people.In this system, disabled people must be hired and selected based on review criteria for disabled people (such as those concerning welfare) and on-site confirmation surveys [44].Similarly, the Korea Occupational Safety and Health Agency provides specific amounts of money to support clean workplace systems for continuously improving various factors to create safe workplaces (e.g., industrial accident occurrence factors, environmental factors, and process procedures) [45].
The three selected manufacturers all have certifications related to industrial safety from the Korean government.In general, it is necessary to establish risk prevention measures to prevent and manage industrial safety accidents.In this study, each manufacturer does not represent only one manufacturing field; instead, they mean different types of industries and work sites.Hazardous situations commonly applied in industrial safety can be derived as scenarios.Manufacturer A is a mask manufacturer, B is a precision parts manufacturer, and C is a steel structure manufacturer.Table 4 shows each manufacturer's safety management characteristics and the results of the investigation regarding objects related to them.

Wearing object
Personal protective equipment (Hygiene gear)

Clean workplace Clean workplace
All three manufacturers handle cutting tools and pose risks of operator collisions with carts, trucks, forklifts, and oversized materials.Workers of the three manufacturers must wear safety equipment such as hard hats and gloves.In the case of mask manufacturer A and precision parts manufacturer B, they must also use hygiene equipment such as caps and gloves.
Accordingly, an area of the workplace for handling cutting tools is classified as hazardous; an area where masks or precision parts are assembled is classified as a hygiene area.Asking a common query from the three manufacturers results in two or more people.As all three manufacturers have a risk of workers colliding with moving equipment such as carts, trucks, and forklifts both indoors and outdoors, this scenario is extracted as a risk scenario.Another extracted risk scenario comprises lack of personal protective equipment.
In addition, the Korea National Information Society Agency recommends guidelines to determine whether a dataset is suitable for a learning dataset; moreover, the quality of datasets is being actively studied in the academic world [46][47][48][49].Table 5 shows the results applied in this study to improve the dataset's quality, based on the Korean government's guidelines for dataset suitability and previous references.CCTV is filmed separately from the elevation and front for each situation Table 5 shows three manufacturers with various complex industries selected to protect diversity, reliability, and consistency.Figure 6 shows photos collected from these manufacturers.In Figure 6a is an image taken by manufacturer A, assuming a situation in which materials and workers collide.(b) is a photo taken by manufacturer B, considering a situation where a forklift and worker collide.The accidents were safely reproduced and filmed under the supervision of a safety manager.The filming was conducted using CCTV installed at the manufacturer's location.We filmed the accidents because finding accident situations in the existing working video is relatively difficult.(c) is an image taken by manufacturer C; it represents a hazardous situation in which materials are loaded into an aisle.(d) to (f) are additional images taken from the front (unlike (a) to (c)) to verify whether personal protective equipment is being worn.Table 6 summarizes the videos used in the experiment according to the scenario.Nine objects are selected according to the scenarios mentioned above.Specifically, workers, materials, carts, forklifts, trucks, hard hats (or hygiene hats), safety clothes (or hygiene clothes), masks, and safety gloves (or hygiene gloves) are selected, as shown in Figure 7.

Training Data Labeling and Standardization
After collecting the learning data according to the scenario, to determine whether or not the scenario is hazardous in the risk judgment model, objects (such as workers, trucks, forklifts, and personal protective equipment) in the image must first be extracted.A person must directly label an object in the image for the model to detect the object.
Labeling refers to drawing a bounding box for an object in a frame (or image) and naming the object.As a labeling tool, CVAT was used by comparing the three features in Table 1.CVAT is provided as an open-source tool by the OpenCV Foundation [25].Notably, while labeling, the frames in images with hazardous scenarios are separately organized.Figure 8 summarizes the labeling process and frames related to the hazardous images.The labeling result is output as a frame image of the video and YOLO model format.The most intuitive way to express the bounding box of an object is in the Microsoft "Common Objects in Context" (COCO) format.This method uses the object's top-left point (  ,   ) horizontal size   , and vertical size ℎ  .In contrast, the YOLO format is expressed in the form of x, y, w, h as the relative values of the center co-ordinates (, ) of the object and according to resolution size (  , ℎ  ) of the entire image.The formula for converting the COCO format to the YOLO format is shown in Equation ( 2) [30,50].
The frames, objects appearing in such frames, and risk events are integrated and standardized.We add metadata such as which manufacturer the learning data were recorded from, the number of FPS and format of the recorded video, actual time, and resolution.However, the entire recorded video cannot be used as the training data, so the video is divided.It contains information regarding events and objects occurring in frames obtained from the segmented images.
The learning data are written in JSON format to manage this information quickly.JSON is a data exchange format grammar organized by the European Computer Manufacturers Association [51].The reason for standardizing the learning data as JSON data is to save the results of the object labeling for the images in a form similar to that of the PASCAL Visual Object Classes (VOC) and Microsoft COCO image datasets [50,52].PAS-CAL VOC is an extensible markup language, and the Microsoft COCO dataset uses JSON.The standardized final result is schematized in a tree form, as shown in Figure 9.The circle at the front denotes the root node and contains image information (including segmented image information).The video is split because the video taken by the CCTV is transferred by day or week.This video is too large to label all of it at once, so it is cut into frame sections.In this study, the sections comprise 18,000 frames or less.
The video information includes the video name, length, number of FPS, total frame length, video format, the location where the video was taken, the start and end times, and the horizontal and vertical length of the video frame.
The segmented image information contains the division number given in the order in which the video was cut, along with the start and end numbers of the frames.As innumerable such frames can exist, the video is separated once more.As a result, each frame number contains event information of a risk occurring in the corresponding frame and object information appearing in the corresponding frame.
The event information includes the event numbering, name, and unique event number.The event numbering is separated because several events may occur in a frame.The particular event number is the same as in the hazardous scenario in Table 6.The object information consists of the object numbering, name, and object identification number.The objects are uniquely numbered from 1 to 9 in the order of worker, material, cart, forklift, truck, hard hat (or hygiene hat), safety suit (or hygiene suit), mask, and safety glove (or hygiene glove).
This information is stored to subsequently test the model, as discussed in Section 4. It is used to check whether the model accurately judges a hazardous situation as hazardous.In addition, appropriate data collection is essential to applying the risk assessment model to the other risk scenarios suggested herein or to other manufacturers.Thus, ensuring the data have been appropriately collected is important.

Model Design
Figure 10 shows the process from data preprocessing to inference after selecting the three manufacturers.After the data preprocessing and YOLOv4 model training described in the previous section, the objects are tracked through weight files.After the object tracking, four items are inspected.Table 6 is simplified.If two or more workers are present, further inspections are conducted on loading goods in the aisle, whether or not personal protective equipment is worn, and collisions with moving objects.
In the case of a moving object collision, assuming that a worker and equipment (such as a truck or forklift) are moving in opposite directions, the collision between the two objects is judged to have occurred when an IoU threshold value of the two objects is exceeded.Figure 11 depicts a situation corresponding to such a judgment.Figure 12 shows cases of detecting the presence or absence of the loading of goods and the wearing of safety equipment.

Model Implementation
Table 7 provides the hardware specifications for preprocessing the training data and model learning, along with the installed programs or libraries.YOLOv4 uses object detection probability values.If the probability value is exceeded, the corresponding object is considered as detected.This study uses a 50% probability value.In addition, the IoU is used as a reference value for determining when a worker collides with a moving object; here, it is set to 0.05.This is because the degree of overlap (IoU) is minimal.Specifically, the object is smaller than in an image taken from the front because it is captured from a CCTV installed at an elevated angle.

Evaluation and Consideration
We measure the response time, delay time, total processing time, object detection precision, and risk determination precision to observe the results from the industrial accident detection model in a real environment and to explore whether this model can be used in actual industrial sites.

Time Measurement
This study evaluates the moving object collision scenario video, as this is the most complex and time-consuming situation to determine.First, we want to calculate the response time based on this video.Thus, 18,000 frames are read from the test video data, and the time required to start and end the risk determination is calculated as an average value.The average value of the time required for risk determination (including the time required for the object detection process) is the delay time.Figure 13 shows the response and delay times required to start the risk determination for each frame as graphs.In Figure 13, the response and delay times are displayed in units of 1000 frames for all 18,000 frames.The average response time is 15.94 ms and the average delay time is 18.43 ms.The average delay time is higher because YOLOv4 incorporates the object detection time.
We use the same test video to measure the total processing time for the risk determination (i.e., the total time from the start until the time the information directly labeled by a person and video frame are displayed on the screen).Then, we measure the time required to process the JSON file storing the video frames and risk events.JSON used two types of files: 1000 files and 18,000 frames of test video.Figure 14 provides a graph of the total processing time.As shown in Figure 14, the total processing time is graphed in units of 100 JSON files and takes an average of 0.23 s.We measure the time taken to read and process the JSON file containing the risk event because the model can be retrained according to industrial site environment changes when being used in the industrial field.This is to check in advance whether a hazardous situation in the industrial site is correctly determined and whether this model is suitable for actual deployment.

Detection and Judgment Precision
From the object detection and risk judgment accuracies, it is confirmed that the actual model accurately identifies objects and determines hazardous situations.The formula for obtaining the precision is the same as Equation (1).In object detection, true positive is when the object predicted by the model matches the object information directly labeled by the person in the JSON file; thus, the model is correct.A false positive is when the model does not identify an object, and the model is incorrect because the object information was entered in the JSON file.
In contrast, in risk judgment, a true positive refers to a case where the model classifies an event as hazardous and correctly organizes it as a hazardous event in the JSON file.A false positive is a case where the model classifies the event as not hazardous but the judgment is wrong (i.e., a risk event exists in the JSON file and an actual risk has occurred).Table 8 lists the results regarding object detection and risk judgment precision.Looking at Table 8, the object detection accuracy is based on YOLOv4, so the precision and recall show good results.However, although the model showed 97.06% precision for the risk situation judgment, it showed a slightly lower result for the rate of recall.This represents a false negative, meaning that a situation was judged as hazardous in the model but not labeled as hazardous in the JSON file.Owing to these false alarms, the recall rate is slightly lower.

Discussion
In the case of a collision between a worker and a moving object, such as a forklift or truck, it was not as challenging to identify the object.This was because the object's size was small, owing to most CCTVs being installed at high angles.Identifying collusions by simply calculating the distance between two objects is challenging.We tuned the degree of overlap between them and determined the collision based on this threshold.Scenarios where safety equipment is worn or not, with two or more workers, or when goods are loaded in an aisle differ from simple problems of determining whether an object is present.As such, the evaluation based on these collision experiments is significant.

Conclusions
In this study, three manufacturing industries were selected, common risk scenarios were extracted, and an industrial accident prevention model was implemented using an object detection model.Among the common risk scenarios, collisions between workers and moving objects and workers failing to wear safety equipment were identified.The model was evaluated in terms of time and accuracy for application in actual environments at industrial sites.The inference time was approximately 18 ms and the precision of the risk judgment was 97.06%.
This study used data preprocessing to detect objects and identify hazardous situations.Direct object labeling information, risk event information, and metadata of the corresponding video were combined into a single JSON format file.However, as the model was designed based on rules for risk detection and the algorithm was designed to meet these rules, it showed high accuracy.However, a limit exists insofar as its application to other industrial sites.In the evaluation results, the false negative rate slightly increased the risk judgment, showing a drop in the recall rate.This is a situation that the model judged to be hazardous but was not hazardous.This situation can eventually cause inconveniences to workers or employers working in actual industrial sites; corresponding studies are left for future research.
In addition, it is believed that the deviations depending on the CCTV installation environment will vary greatly.Most CCTVs are installed at high angles, making it difficult to identify objects.In addition, they may be covered by lighting facilities installed in industrial sites, installed medium-large equipment, etc.Moreover, the data collection process may become biased when directly applied in the workplace.For example, data representing hazardous situations collected from the workplace may be excluded, i.e., administrators operating the CCTVs may delete these data to avoid institutional penalties.In such a case, the model will only train on safe situations, so it may fail to detect unsafe situations.Although this study did not solve all of these limitations, it showed that risk detection based on object detection is possible.Thus, this study can contribute to worker safety via industrial accident prevention solutions and safety manager environmental monitoring systems.

Figure 1 .
Figure 1.Fatal occupational injuries per 100,000 workers among OECD countries.

Figure 2 .
Figure 2. Number of injured (left) and deaths (right) by industry.

Figure 3 .
Figure 3. Number of injured (left) and number of deaths (right) by business size (number of workers).
Figure 4 is a simple schematic diagram of the YOLOv4 model.

Figure 6 .
Figure 6.Images were collected from three manufacturers.

Figure 7 .
Figure 7. Example of video frames containing nine selected objects.

Figure 8 .
Figure 8. Labeling process using "Computer Vision Annotation Tool" (CVAT) and frames in which hazardous situations were filmed.

Figure 9 .
Figure 9. Building model learning data in JavaScript Object Notation (JSON) format for manufacturer hazardous situation detection.

Figure 10 .
Figure 10.Industrial accident detection model design flow chart.

Figure 11 .
Figure 11.Detect hazardous situations when a worker and forklift collide.

Figure 12 .
Figure 12.Loading of goods and inspection of personal protective equipment.

Table 1 .
Feature comparison of three annotations (labeling) tools.

Table 1
indicates the functions provided by each labeling tool.DarkLabel is not opensource and works only on the Windows operating system, which is not web-based.In contrast, YOLO Mark and the "Computer Vision Annotation Tool" (CVAT) are opensource and work well with other operating systems.

Table 2 .
Compare accuracy and frames per second (FPS) of object detection models, including region-based convolutional neural network (R-CNN) and single-shot detector (SSD).

Table 3 .
Comparative experiment results for object detection model selection.

Table 4 .
Investigation of safety management features and objects for each manufacturer.(•: required)

Table 5 .
Training dataset suitability items from the Korean government and results.

Table 6 .
Information according to the hazardous scenario for the frames used in the experiment.

Table 7 .
Hardware specifications of training data preprocessing system and model training system.

Table 8 .
Results of the object and hazardous situation detection.