Examination of Abnormal Behavior Detection Based on Improved YOLOv3

Examination is a way to select talents, and a perfect invigilation strategy can improve the fairness of the examination. To realize the automatic detection of abnormal behavior in the examination room, the method based on the improved YOLOv3 (The third version of the You Only Look Once algorithm) algorithm is proposed. The YOLOv3 algorithm is improved by using the K-Means algorithm, GIoUloss, focal loss, and Darknet32. In addition, the frame-alternate dualthread method is used to optimize the detection process. The research results show that the improved YOLOv3 algorithm can improve both the detection accuracy and detection speed. The frame-alternate dual-thread method can greatly increase the detection speed. The mean Average Precision (mAP) of the improved YOLOv3 algorithm on the test set reached 88.53%, and the detection speed reached 42 Frames Per Second (FPS) in the frame-alternate dual-thread detection method. The research results provide a certain reference for automated invigilation.

For example, an article [7] published in Energies shows how to detect isolator failures in a power grid using a YOLOv3 network. The article shows how to reduce energy expenditure in the operation process using modern methods. This article deals with a similar topic. We also show you how to reduce energy expenditure, but in a completely different application and under completely different conditions. The examination is a knowledge appraisal method generally accepted by the public. Currently, implemented vision methods of checking the integrity of exam-takers still have disadvantages. The invigilator has a limited vision, and the review of surveillance videos in the examination room is time-consuming and laborious. As a result, cheating is widespread, and the fairness of the examination cannot be guaranteed. The survey shows that the proportion of cheating students in colleges and universities is close to 50.0%, and even as high as 80.0% in some research results. In a survey on the question "Have any classmates or friends you know cheated on examinations?", only 2.7% of the students answered "absolutely not." It has become a common phenomenon for students to cheat in various examinations. Strengthening the standardization of invigilation can reduce the possibility of cheating in examinations to a certain extent [8].
The emergence of deep learning has promoted the development of computer vision. Detecting abnormal behavior in examinations is a typical computer vision task, which can be achieved by deep learning methods. Currently, the object detection algorithm based on deep learning has made great achievements in many fields [9][10][11][12]. YOLOv3 [13][14][15][16] is a typical object detection algorithm based on deep learning. It uses convolutional neural networks (CNN) to complete the detection task and directly returns the position and category of the object. It is known for its high detection accuracy and high detection speed, and it is widely used in vehicle detection, pedestrian detection, ship detection, garbage detection, License plate recognition, etc. [17][18][19][20][21] In addition, YOLOv3 and its improved algorithm are also often used to detect small objects [22,23]. However, there are few studies on using YOLOv3 to detect abnormal behaviors in examinations. By distinguishing several abnormal behaviors from normal ones in the examination, and marking the detected abnormal behaviors, the suspected cheating students can be quickly obtained for the key observation. This research can not only save more labor costs and improve the efficiency of surveillance video review, but also greatly promote the fairness of the examination and the maintenance of a good test order.
In this paper, by selecting the appropriate algorithm and improving this one, both the detection accuracy and detection speed of the abnormal behavior in the examination are improved. At the same time, the alternate-frame dual-thread technology is used to further improve the detection speed of the algorithm to meet the needs of real-time detection.

Materials and Methods
The research work on object detection algorithms has experienced the transition from traditional algorithms to deep learning ones [24]. The traditional object detection methods, such as the algorithm of fast multi-feature pedestrian detection based on the histogram of an oriented gradient (HOG) using discrete wavelet transform proposed by Gwang-Soo Hong [25], and distinctive image features from scale-invariant key points proposed by Lowe D. [26], etc. It is necessary to manually design features according to different experimental scenarios, and then input the extracted object features into classifiers such as Support Vector Machines (SVM) [27] and Adaboost [28,29] for recognition. The feature extraction process of the traditional object detection algorithms is more complicated. The detection accuracy and detection speed are not as good as the object detection algorithms based on deep learning, and the generalization ability of the model is poor [30].
The Region-based Convolutional Neural Network (R-CNN) algorithm proposed by Girshick [31,32] et al. applied deep learning to object detection for the first time [33]. The improved Fast R-CNN algorithm combines the advantages of Spatial Pyramid Pooling Network (SPPNet) [34], which effectively improves the accuracy of object detection. However, these two algorithms use selective search algorithms when extracting regions, which are computationally expensive, consume many memory resources, and are slow in processing speed. The Faster R-CNN algorithm proposed by Ren [35] et al. increases the candidate area network, and extracts candidate boxes by setting anchor boxes of different scales, which greatly improves the detection accuracy, but it still does not solve the problem of slow detection speed and cannot meet the needs of real-time detection. The regression-based object detection algorithm is represented by YOLO [36], SSD [37], YOLOv2/YOLO-9000 [38], and YOLOv3. Although their detection speed has been greatly improved, their detection accuracy has been decreased.
Most of the current research is based on the improvement of the above algorithm. The vehicle detection method based on improved Region-based Fully Convolutional Network (R-FCN) was proposed by Hu Hui [39] et al. The R-FCN, combined with multi-scale training, deformable network, and soft Non-maximum suppression (soft-NMS) [40] improves the detection accuracy, but the generalization ability of the model is poor. Zhao Baokang [41] and others proposed the DFS (Depth-First Search) algorithm for object detection in remote sensing images. They redesigned the dimensional clustering module, loss function, and the detection mechanism based on the sliding window segmentation to improve both the positioning accuracy of objects and the detection accuracy of small objects, but its recall rate has decreased significantly. The foreground object detection Electronics 2021, 10,197 3 of 17 algorithm based on an adaptive threshold adjustment proposed by Li Xingxin [42] and others have achieved good accuracy in railway scenes, but the algorithm has large memory consumption. Qiao Ting [43] and others enhanced the training set and designed a twochannel network for feature extraction of Faster R-CNN, which improved the detection accuracy of the algorithm, but its detection speed was not improved. Therefore, real-time detection cannot be performed.
The previously mentioned research results are all targeted algorithm designs, which are carried out in specific scenarios, and are not suitable for detecting abnormal behaviors in the examination. Lin Yongzheng [44] et al. proposed a cheating detection method based on the dynamic threshold by analyzing the behavioral characteristics of exchanging items. The iterative threshold method was used to determine the dynamic threshold to segment the differential image. The segmentation results were used to guide the update of the background, and completed the detection of cheating in the examination room based on the background subtraction algorithm. However, this method can only detect the abnormal behavior of exchanging items, and the test results were not given in the article. Dai Jinbo [45] and others proposed a method of abnormal behavior detection in the examination room. They proposed concepts, such as a behavior coverage area and 3D examination room attention. They used Latent SVM to build a model, but its accuracy and speed did not have clear advantages. It should also be noted that the detection range is too small to be applied to actual exam scenarios.
This paper draws on the experience of related researchers, and uses the YOLOv3 algorithm to establish models. In order to achieve the purpose of detecting abnormal behavior in the examination, some improvements have been made to the YOLOv3 algorithm. First, the loss function is improved. Next, the size of the anchor boxes used in the algorithm is modified. Then, the backbone network is simplified. Finally, the frame-alternate dualthread method is used for detection to further improve the detection speed and meet the needs of real-time detection.
The rest of this paper is organized as follows. The third part describes the YOLOv3 algorithm, K-Means clustering of bounding boxes, improved backbone Darknet32, improved loss function, and frame-alternate dual-thread principle. The fourth part is the experimental and result analysis. It introduces the experimental environment, data sets, and model evaluation standards, analyzes the detection accuracy by constructing models, then uses four video detection methods to detect the video, and analyzes the detection speed and memory consumption. The last part summarizes the paper and gives relevant conclusions.

Methodology YOLOv3
The YOLOv3 algorithm has made many improvements in YOLOv1 and YOLOv2, so that both the detection accuracy and detection speed have been significantly improved. Its core idea lies in the realization of "end-to-end" using CNN to complete the entire object detection process. The network structure of YOLOv3 can be divided into a backbone and a head (as shown in Figure 1), which perform feature extraction and multi-scale prediction, respectively. It draws on the feature fusion pyramid idea of Feature Pyramid Networks (FPN) [46], extracts the features of the previous layer through the up-sampling operation, fuses the features of the current layer, and then predicts from three scales. It realizes the detection of objects of different sizes, and has a good detection effect even when the object is partially occluded.
In Figure 1, DBL represents a complete convolutional layer, including three operations: Convolution Operation (Conv), Batch Normalization (BN), and Leaky Rectified linear unit (Leaky Relu) activation function. RES_n is a set of residual networks, composed of Zero Padding, DBL, and n residual units (RES Unit). The RES Unit adds the input of the unit to the output of the unit based on the DBL. Concat represents the splicing operation of the features of two different layers.  Figure 1. YOLOv3 structure.
In Figure 1, DBL represents a complete convolutional layer, including three operations: Convolution Operation (Conv), Batch Normalization (BN), and Leaky Rectified linear unit (Leaky Relu) activation function. RES_n is a set of residual networks, composed of Zero Padding, DBL, and n residual units (RES Unit). The RES Unit adds the input of the unit to the output of the unit based on the DBL. Concat represents the splicing operation of the features of two different layers.
The backbone Darknet53 of YOLOv3 is similar to ResNet [47]. The residual block added to the network to realize the rapid transmission of parameters between certain layers, alleviates the network degradation problem faced by deep CNN, and enables the network to be built deeper. When Darknet53 is used for object detection, the fully connected (FC) layer is removed, so it contains 52 convolutional layers.
The part outside the dashed box in Figure 1 is the head network, which is used to obtain the location and category of the object. When acquiring the location area of the object, YOLOv3 divides the image into S × S grids from three scales (as shown in Figure  2). The grid, where the center of the object is located, is responsible for predicting the object, and each grid contains B bounding boxes and confidences, and C category probabilities. The anchor mechanism introduced in YOLOv3 sets anchor boxes with different sizes and aspect ratios on three scales, and each grid predicts the bounding box of three different aspect ratios (1:1, 1:2, 2:1). Each bounding box contains four coordinate information (tx, ty, tw, th) and one confidence information. To solve the multi-label classification problem, YOLOv3 uses multiple logistic regression classifiers (sigmoid function) instead of the SoftMax function, and uses binary cross entropy loss to calculate the category loss. The backbone Darknet53 of YOLOv3 is similar to ResNet [47]. The residual block added to the network to realize the rapid transmission of parameters between certain layers, alleviates the network degradation problem faced by deep CNN, and enables the network to be built deeper. When Darknet53 is used for object detection, the fully connected (FC) layer is removed, so it contains 52 convolutional layers.
The part outside the dashed box in Figure 1 is the head network, which is used to obtain the location and category of the object. When acquiring the location area of the object, YOLOv3 divides the image into S × S grids from three scales (as shown in Figure 2). The grid, where the center of the object is located, is responsible for predicting the object, and each grid contains B bounding boxes and confidences, and C category probabilities. The anchor mechanism introduced in YOLOv3 sets anchor boxes with different sizes and aspect ratios on three scales, and each grid predicts the bounding box of three different aspect ratios (1:1, 1:2, 2:1). Each bounding box contains four coordinate information (tx, ty, tw, th) and one confidence information. To solve the multi-label classification problem, YOLOv3 uses multiple logistic regression classifiers (sigmoid function) instead of the SoftMax function, and uses binary cross entropy loss to calculate the category loss.

Obtaining the Optimal Anchor Boxes
In order to achieve rapid detection of objects of different sizes and aspect ratios, the YOLOv2 algorithm uses fixed-size anchor boxes as references for the boundary of the object. The choice of suitable anchor boxes can improve the detection accuracy of the algo-

Obtaining the Optimal Anchor Boxes
In order to achieve rapid detection of objects of different sizes and aspect ratios, the YOLOv2 algorithm uses fixed-size anchor boxes as references for the boundary of the object. The choice of suitable anchor boxes can improve the detection accuracy of the algorithm. The anchor boxes used in the YOLOv3 algorithm are derived from 80 categories of data in the COCO (look: https://cocodataset.org/#home 15.01.2021) data set, which are suitable for most detection scenarios, but are not completely suitable for abnormal behavior detection in the examination. Therefore, this paper re-selects more suitable anchor boxes to improve the detection accuracy.
The K-Means [48][49][50] algorithm uses distance as the classification criterion. The smaller the distance between two samples, the more similar they are. The K-Means algorithm generally uses Euclidean distance as a metric, but this method is not suitable for clustering bounding boxes. Therefore, this paper uses Intersection over Union (IoU) as a distance metric. The calculation formula is as follows.
In the formula, B is the bounding box, C is the cluster center, and IoU(B,C) represents the intersection ratio of two rectangular boxes.
The steps to use the K-Means algorithm to get the optimal anchor boxes are as follows.
Step 1: Read the .Xml file in the data set to obtain the position information (x min ,y min , x max ,y max ) of all bounding boxes in the images.
Step 2: Calculate the size of all bounding boxes and normalize them to get the normalized width and height of each bounding box. The calculation method is as follows.
In Equations (2) and (3), w is the normalized width of the bounding box and h is the normalized height of the bounding box. W is the width of the image and H is the height of the image.
Step 3: Initialize the number of categories and cluster centers. The number of categories is given artificially, and the cluster centers are given randomly.
Step 4: Calculate the distance d(B,C) between each bounding box and all cluster centers, and select the nearest cluster center as its category.
Step 5: Use the average of the width and height of all bounding boxes in each category cluster as the category center for the next iteration.
Repeat steps 4 and 5 until the cluster centers of all categories do not change. At this time, the cluster centers are the best anchor boxes.
The choice of the number of anchor boxes is not random. It is determined based on the average distance (mean IoU) of all bounding boxes to their cluster centers. In the experiment, the value of the number of clustering categories lies within the range from 2 to 20, and the mean IoU under each value is obtained, as shown in Figure 3 (the number of categories is the number of anchor boxes).
It can be seen from Figure 3 that the curve tends to be flat after the number of anchor boxes is 9. Considering that the more anchor boxes, the more parameters, which is not suitable to real-time detection. Therefore, this paper chooses nine anchor boxes. The sizes of the initial and final anchor boxes for YOLOv3 are given in Table 1.
Repeat steps 4 and 5 until the cluster centers of all categories do not change. At this time, the cluster centers are the best anchor boxes.
The choice of the number of anchor boxes is not random. It is determined based on the average distance (mean IoU) of all bounding boxes to their cluster centers. In the experiment, the value of the number of clustering categories lies within the range from 2 to 20, and the mean IoU under each value is obtained, as shown in Figure 3 (the number of categories is the number of anchor boxes). It can be seen from Figure 3 that the curve tends to be flat after the number of anchor boxes is 9. Considering that the more anchor boxes, the more parameters, which is not suitable to real-time detection. Therefore, this paper chooses nine anchor boxes. The sizes of the initial and final anchor boxes for YOLOv3 are given in Table 1.

Improved Backbone Darknet32
The Darknet53 used in YOLOv3 has good detection accuracy, but the huge network is complicated and redundant for the detection of abnormal behaviors in the examination. Too many parameters will lead to more complex training, more data requirements, and slower detection speed. In order to improve the detection speed of abnormal behavior in the examination and maintain high detection accuracy, this paper draws on Darknet53 and proposes a new CNN structure called Darknet32. Its network structure is shown in Figure 4.
There are six groups of networks with residual blocks in Darknet32. Compared with the five groups of networks in Darknet53, the number of residual blocks in each group is reduced. In order to balance the effect of feature extraction, an additional set of networks is added. The Multi-Scale Training method is adopted to continuously adjust the size of the input image during the training process, so that the network can better predict images of different scales. In addition, the pooling layer is no longer set in the network, and down-sampling is achieved through convolution.
The workflow of the YOLOv3 algorithm using Darknet32 as the backbone is as follows. First, the image is input into the backbone for feature extraction. After six groups of networks with the total of 12 residual blocks, 31 convolution operations are performed. The results of feature extraction are output from three scales. Then, the deep and shallow features are merged through up-sampling operations. Finally, the detection objects are predicted from three scales. For the detection results, the soft-NMS algorithm is used to filter the detected bounding boxes.
The parameters and floating point operations (FLOPs) of the original YOLOv3 algorithm and the YOLOv3 algorithm using Darknet32 as the backbone are calculated, as shown in Figure 5.
The Darknet53 used in YOLOv3 has good detection accuracy, but the huge network is complicated and redundant for the detection of abnormal behaviors in the examination. Too many parameters will lead to more complex training, more data requirements, and slower detection speed. In order to improve the detection speed of abnormal behavior in the examination and maintain high detection accuracy, this paper draws on Darknet53 and proposes a new CNN structure called Darknet32. Its network structure is shown in Figure 4.  There are six groups of networks with residual blocks in Darknet32. Compared with the five groups of networks in Darknet53, the number of residual blocks in each group is reduced. In order to balance the effect of feature extraction, an additional set of networks is added. The Multi-Scale Training method is adopted to continuously adjust the size of the input image during the training process, so that the network can better predict images of different scales. In addition, the pooling layer is no longer set in the network, and downsampling is achieved through convolution.
The workflow of the YOLOv3 algorithm using Darknet32 as the backbone is as follows. First, the image is input into the backbone for feature extraction. After six groups of networks with the total of 12 residual blocks, 31 convolution operations are performed. The results of feature extraction are output from three scales. Then, the deep and shallow features are merged through up-sampling operations. Finally, the detection objects are predicted from three scales. For the detection results, the soft-NMS algorithm is used to filter the detected bounding boxes.
The parameters and floating point operations (FLOPs) of the original YOLOv3 algorithm and the YOLOv3 algorithm using Darknet32 as the backbone are calculated, as shown in Figure 5. It can be seen from Figure 5, when compared with the original YOLOv3 algorithm, the YOLOv3 algorithm using Darknet32 as the backbone reduces the amount of parameters and floating point operations (FLOPs) by 41%.

Loss Function
The loss function of the YOLOv3 algorithm is composed of bounding box loss, confidence loss, and classification loss. The bounding box loss of the original YOLOv3 is calculated by the L2 norm, but the IoU is used to determine whether the object is detected during evaluation. However, IoU does not increase with the L2 loss decreases, for there is no linear relationship between IoU and L2 loss, so the bounding box loss function needs It can be seen from Figure 5, when compared with the original YOLOv3 algorithm, the YOLOv3 algorithm using Darknet32 as the backbone reduces the amount of parameters and floating point operations (FLOPs) by 41%.

Loss Function
The loss function of the YOLOv3 algorithm is composed of bounding box loss, confidence loss, and classification loss. The bounding box loss of the original YOLOv3 is calculated by the L2 norm, but the IoU is used to determine whether the object is detected during evaluation. However, IoU does not increase with the L2 loss decreases, for there is no linear relationship between IoU and L2 loss, so the bounding box loss function needs to be optimized. We have considered the direct use of IoU in the loss function, but the calculation of IoU does not consider the non-overlapping area. It cannot reflect the distance between two rectangular boxes and the overlapping form. When there is no overlap between the two boxes, IoU = 0 causes the gradient to be 0 and cannot be optimized, so IoU cannot be directly used to calculate the bounding box loss. The method of optimization of the Generalized Intersection over Union (GIoU) bounding box proposed by Rezatofighi [51] et al. considers the overlap of two boxes, and its calculation formula is as follows.
In the formula, A and B are the predicted bounding box and the true bounding box, and C is the smallest closed interval containing A and B. According to formula (2), it can be seen that GIoU and IoU are positively correlated. When the two boxes are closer to each other in size and distance, the GIoU is closer to IoU. Therefore, GIoU can be used as a measure to calculate the bounding box loss. The GIoU loss calculation formula is shown below.
The calculation formula of the improved bounding box loss function is below.
In the formula, s 2 is the number of grids, B is the number of bounding boxes detected in each grid, 1 obj ij ∈ {0, 1} indicates whether the bounding box j of grid i is responsible for predicting this object, andŵ i andĥ i are the width and height of the true bounding box.
The confidence loss of the YOLOv3 algorithm is calculated by binary cross entropy loss. In order to solve the problem of unbalanced distribution of positive and negative samples, this paper uses the focal loss [52] to optimize the confidence loss. The formula for optimized confidence loss calculation is as follows.
In the formula, α is the weight coefficient in the focal loss function, γ is the hyperparameter added in the focal loss function, λ noobj is the weight coefficient, andĈ i and C i are the true and predicted values of confidence, respectively.
The classification loss is still calculated by binary cross entropy loss, and its calculation formula is shown below.
In the formula,p i (c) and p i (c) are the true value and predicted value of the category, respectively.

Frame-Alternate Dual-Thread Detection Method
This paper uses recorded videos containing a large number of abnormal behaviors in the examination for detection, but, in the actual examination, abnormal behavior occurs infrequently, and the time of occurrence is unpredictable. Frame-by-frame detection is relatively time-consuming, while detection across multiple frames will cause many missed detections. Therefore, this paper adopts the method of crossing one frame for detection, that is, frame-alternate detection. Multi-thread technology can improve the efficiency of program operation, but, with the increase of threads, memory consumption will increase, which contradicts the realization of real-time detection on ordinary performance computers with limited memory. In the experiment, with the continuous increase of the number of threads, it is found that, when the number of threads exceeds 3, the improvement of the detection speed is not clear, but the memory consumption is very large. After measurements, the dual-thread detection method was finally adopted. Compared with a single thread, a dual thread can increase the detection speed, and, when compared with more threads, the memory consumption of the dual thread is still relatively small. A dual thread detection method is meant to allow the computer to perform two different tasks at the same time. Therefore, the detection task is divided into two parts. The main thread completes the task of reading and outputting the video frame-by-frame, and the sub-thread completes the task of frame-alternate detection and labeling, as shown in Figure 6.

Sub thread Main thread
Annotate image Use the bounding box of the previous image Figure 6. Alternate-frame and dual thread detection process. Figure 6. Alternate-frame and dual thread detection process.

Experimental Environment and Data Set
divide the images into the training set and test set randomly. Fourth, use LabelImg software to annotate the images, that is, use rectangular boxes to mark abnormal behaviors in the images, and generate .Xml files based on the position and name of the rectangular boxes. Finally, according to the format of the PASCAL VOC data set, the examination abnormal behavior data set is sorted out. A total of 4120 valid images were obtained, including 8973 abnormal behavior annotation boxes. The training set contained 3740 images and 8105 annotation boxes. The test set contains 380 images and 868 annotation boxes.

Evaluation Index of the Model
This paper evaluates the performance of the algorithm from two aspects: detection accuracy and detection speed. Detection accuracy is evaluated by the average precision (AP), mean average precision (mAP), and detection speed is evaluated by the frames per second (FPS). Since FPS is greatly affected by the performance of the experimental machine, in order to ensure that the experimental results have a reference significance, this experiment was carried out on the same machine.
The P-R curve uses recall and precision as the horizontal and vertical coordinates. AP is the area enclosed by the P-R curve and the abscissa. It can be calculated by integration. The calculation formula is as follows.
In the formula, P(R) is the curve function obtained after smoothing the P-R curve, and R is the recall.
mAP represents the mean AP of N categories, and the calculation formula is as follows.
FPS represents the number of frames detected per second, and the calculation formula is as follows. FPS = Frames Seconds (11)

Analysis of Detection Accuracy
Based on the improvement points proposed in this article, three models are established. YOLOv3_G is a model built after optimizing the loss function. YOLOv3_G_KM is a model built after optimizing the loss function and combined with the best anchor boxes. YOLOv3_G_KM_D32 (algorithm using optimized loss function and backbone, and combining with the best anchor boxes-the algorithm proposed in this paper). is a model built after optimizing the loss function and backbone, and combining with the best anchor boxes. Comparison of the previously mentioned three models with the original YOLOv3 algorithm model is shown in Figure 8. lished. YOLOv3_G is a model built after optimizing the loss function. YOLOv3_G_KM is a model built after optimizing the loss function and combined with the best anchor boxes. YOLOv3_G_KM_D32 (algorithm using optimized loss function and backbone, and combining with the best anchor boxes -the algorithm proposed in this paper). is a model built after optimizing the loss function and backbone, and combining with the best anchor boxes. Comparison of the previously mentioned three models with the original YOLOv3 algorithm model is shown in Figure 8.  As it can be seen from the figure, every improvement can improve the overall detection accuracy of the algorithm to a certain extent. The mAP of the YOLOv3_G_KM_D32 proposed in this paper on the test set reached 88.53%, which is the highest among several models, and 5.22% higher than that of the original YOLOv3. From the perspective of the detection accuracy of a single behavior, the improved YOLOv3_G_KM_D32 has significantly higher detection accuracy for each type of behavior than that of the original YOLOv3 algorithm. For the "look around" behavior with the worst detection accuracy, every improvement of the algorithm will significantly improve its detection accuracy, and the final algorithm increases its detection accuracy by 16%. It can be concluded that optimizing the loss function of the algorithm, using the K-Means algorithm to cluster the bounding boxes in the data set to obtain the best anchor boxes, and using the improved backbone can all improve the detection accuracy of the algorithm in varying degrees.
Among the four behaviors, the detection accuracy of the three behaviors of "deliver things," "hand under the table," and "bend over the desk" is relatively high, but the detection accuracy of the behavior "look around" is obviously low. There may be two reasons for this phenomenon. One is that the training data set is not large enough. The other is that the amplitude of the "look around" behavior is too small, which is not very different from the normal behavior in the examination, and the other three behaviors have clear changes in the amplitude of the movement.

Analysis of Detection Speed
This experiment uses four methods to perform video detection on the improved YOLOv3_G_KM_D32 algorithm and the original YOLOv3 algorithm. They are frameby-frame single-thread, frame-alternate single-thread, frame-by-frame dual-thread, and frame-alternate dual-thread. The results are shown in Figures 9 and 10. It can be seen from Figure 9 that the YOLOv3_G_KM_D32 algorithm model proposed in this paper has a detection speed of 42FPS in the frame-alternate dual-thread detection mode, and 20FPS in the frame-by-frame single-thread detection mode. In any detection method, the detection speed of the YOLOv3_G_KM_D32 is significantly higher than that of the original YOLOv3 algorithm. By using frame-alternate dual-thread detection technology, the video detection speed is increased to 42FPS, which meets the requirements of real-time detection.
frame-alternate dual-thread. The results are shown in Figures 9 and 10. It can be seen from Figure 9 that the YOLOv3_G_KM_D32 algorithm model proposed in this paper has a detection speed of 42FPS in the frame-alternate dual-thread detection mode, and 20FPS in the frame-by-frame single-thread detection mode. In any detection method, the detection speed of the YOLOv3_G_KM_D32 is significantly higher than that of the original YOLOv3 algorithm. By using frame-alternate dual-thread detection technology, the video detection speed is increased to 42FPS, which meets the requirements of real-time detection.  With the increase of threads, memory consumption will increase. As shown in Figure  10, the YOLOv3_G_KM_D32 algorithm proposed in this paper consumes 3038 MB of memory in the frame-alternate dual-thread detection mode and only 1969 MB in the frame-by-frame single-thread detection mode. In any detection method, the memory consumption of the YOLOv3_G_KM_D32 algorithm proposed in this paper is significantly less than that of the original YOLOv3 algorithm, which reduces memory consumption while achieving real-time detection.
As shown in Figures 9 and 10, compared to the original YOLOv3 algorithm, the YOLOv3_G_KM_D32 algorithm proposed in this paper performs better in terms of detection speed and memory consumption. In addition, the frame-alternate dual-thread detection method proposed in this paper greatly improves the detection speed. Although the memory consumption has increased, it is within an acceptable range.

Performance Comparison of Different Algorithms
Different algorithms were used in the experiment to establish the model, and the test With the increase of threads, memory consumption will increase. As shown in Figure 10, the YOLOv3_G_KM_D32 algorithm proposed in this paper consumes 3038 MB of memory in the frame-alternate dual-thread detection mode and only 1969 MB in the frame-by-frame single-thread detection mode. In any detection method, the memory consumption of the YOLOv3_G_KM_D32 algorithm proposed in this paper is significantly less than that of the original YOLOv3 algorithm, which reduces memory consumption while achieving real-time detection.
As shown in Figures 9 and 10, compared to the original YOLOv3 algorithm, the YOLOv3_G_KM_D32 algorithm proposed in this paper performs better in terms of detection speed and memory consumption. In addition, the frame-alternate dual-thread detection method proposed in this paper greatly improves the detection speed. Although the memory consumption has increased, it is within an acceptable range.

Performance Comparison of Different Algorithms
Different algorithms were used in the experiment to establish the model, and the test results are shown in Table 2. As can be seen from Table 2, the overall performance of the YOLOv3 series of algorithms is better than that of the SSD algorithm. Therefore, YOLOv3 is more suitable for abnormal behavior detection in the examination. From the perspective of overall detection accuracy, the YOLOv3_G_KM_D32 algorithm proposed in this paper has clear advantages. Its mAP reaches 88.53%, which is 26.09% higher than that of the SSD300 and 5.22% higher than that of the original YOLOv3. From the perspective of a single behavior detection accuracy, only the SSD300 algorithm has a low AP for each behavior detection, and the AP of other algorithms is relatively high, but the YOLOv3_G_KM_D32 algorithm has the highest AP for each behavior. Especially for the detection of "look around" behavior, the AP of the YOLOv3_G_KM_D32 algorithm is 77.52%, which is 41.99% higher than that of the SSD300 algorithm, and 15.13% higher than that of the original YOLOv3 algorithm. In terms of detection speed, the YOLOv3_G_KM_D32 algorithm reaches 42FPS, which is slightly inferior to the detection speed of the SSD300 algorithm, but far surpasses other algorithms.
In general, YOLOv3_G_KM_D32 has achieved good results in both the detection accuracy and detection speed, and has significant accuracy and speed advantages in the detection of abnormal behavior in the examination.  Figure 11 only shows the test result of one image, but, in the experiment, we tested more images and found that the YOLOv3_G_KM_D32 algorithm proposed in this paper has a better detection effect. Especially when the seat of the detected object is in the back row, the YOLOv3_G_KM_D32 algorithm can also be used to accurately detect it, but other algorithms cannot achieve such good results.  Figure 11 only shows the test result of one image, but, in the experiment, we tested more images and found that the YOLOv3_G_KM_D32 algorithm proposed in this paper has a better detection effect. Especially when the seat of the detected object is in the back row, the YOLOv3_G_KM_D32 algorithm can also be used to accurately detect it, but other algorithms cannot achieve such good results.

Conclusions
Aiming at the problem of abnormal behavior detection in the examination, this article adopts the improved YOLOv3 algorithm. Starting from the production of examination of abnormal behavior data sets, by optimizing the loss function of the algorithm, using the K-Means algorithm to obtain the best anchor boxes, and designing a new backbone Dark-net32, we used the frame-alternate dual thread method to detect the video. Analyzing the detection accuracy (AP and mAP), detection speed (FPS), and memory consumption of the algorithm, and getting the following conclusions.
(1) The use of GIoUloss and focal loss to optimize the loss function of the YOLOv3 algorithm, and the use of the K-Means algorithm to cluster the bounding boxes in the data set to obtain the best anchor boxes can improve the algorithm's detection accuracy of abnormal behavior in the examination.

Conclusions
Aiming at the problem of abnormal behavior detection in the examination, this article adopts the improved YOLOv3 algorithm. Starting from the production of examination of abnormal behavior data sets, by optimizing the loss function of the algorithm, using the K-Means algorithm to obtain the best anchor boxes, and designing a new backbone Darknet32, we used the frame-alternate dual thread method to detect the video. Analyzing the detection accuracy (AP and mAP), detection speed (FPS), and memory consumption of the algorithm, and getting the following conclusions.
(1) The use of GIoUloss and focal loss to optimize the loss function of the YOLOv3 algorithm, and the use of the K-Means algorithm to cluster the bounding boxes in the data set to obtain the best anchor boxes can improve the algorithm's detection accuracy of abnormal behavior in the examination. (2) The use of the backbone Darknet32 proposed in this paper for abnormal behavior detection in the examination can improve the detection speed and reduce memory consumption of the computer while ensuring high detection accuracy.
(3) The frame-alternate dual thread detection method can greatly increase the speed of abnormal behavior detection in the examination without consuming a large amount of memory, and this method meets the need of real-time detection.
This paper combines theory with practice. The researched content can be easily integrated with the camera in the examination room to realize real-time automated invigilation. We comprehensively consider the balance of detection accuracy and detection speed, and this paper proposes the use of the improved YOLOv3 algorithm for abnormal behavior detection in the examination. Through the improvement of the YOLOv3 algorithm, the detection accuracy and detection speed are improved, which has a certain reference value for the subsequent development of automated invigilation. Since the experimental data set is not large enough, the examination scenarios are not rich enough, and the settings of abnormal behaviors are not detailed enough. The data collection can be increased in future research. In the research, it is found that the detection accuracy of "look around" is not good enough because the action range of the behavior is too small. This phenomenon will be studied in the next research study to improve the detection accuracy.