Moving object detection is widely used in intelligent video surveillance, traffic monitoring, pedestrian detection, robot navigation, driver assistance, etc. [
1], and many approaches have been proposed in the past. According to whether or not a neural network is involved, existing approaches can be divided into two categories or a combination of them: traditional methods [
2,
3,
4,
5,
6,
7] and neural-network-based methods [
8,
9,
10,
11,
12]. Traditional methods generally use the characteristics of moving objects in image sequences to detect and identify through various video/image processing algorithms, while neural-network-based methods firstly train a neural network using training datasets, then perform the detection.
  1.1. Related Works
Classic traditional methods include frame difference [
2], background subtraction [
3], optical flow [
4], etc. The main idea of frame difference is to extract the moving object according to the different positions of the targets in different frames [
2]. This type of method is simple and can satisfy real-time requirements. However, the detection results often show cavitation and low detection accuracy, and cannot be directly applied to a moving camera. Background subtraction methods build a background model firstly based on statistical principles, and then compare the frame to be tested with the background model to segment the moving objects. Typical background models include the Gaussian mixed model (GMM) [
5], CodeBook [
6], and ViBe [
7]. Background subtraction methods are simple in calculation, fast in speed, high in accuracy, and good in static scenes, but are sensitive to interference factors such as light, leaf shake, and water fluctuation [
1], hence they are not suitable for moving object detection under dynamic scenes. Optical flow methods use the time-varying optical flow characteristics of moving targets to establish the optical flow constraint equation for moving object detection [
4]. The advantages are that they can detect moving objects without prior knowledge of the scene, thus being suitable for dynamic scenes. The shortcoming is that the computational burden is heavy.
Neural-network-based methods can be divided into two types: two-stage detection methods [
8,
9] and single-stage detection methods [
10,
11,
12]. Two-stage detection methods are composed of two steps: generation of candidate regions and classification regression of the detected objects. R-CNN [
8] and SPP-Net (spatial pyramid pooling network) [
9] are classic two-stage detection methods. These methods need to select candidate regions in images in advance, and then classify and locate the object, hence they cannot easily satisfy real-time detection requirements. Single-stage detection methods extract the candidate regions directly, and then continue the classification regression to the candidate regions. YOLOv1 [
10] to YOLOv4 [
11] are typical single-stage detection methods. These approaches only need one neural network to predict the object classification and location, and can thus satisfy real-time detection requirements [
12].
Although a great number of moving object detection methods have been reported in the literature as mentioned above [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12], very few of them can be directly applied to an actual video surveillance system. The reason is that an actual video surveillance system, especially when used outdoors or in the wild, always equips a low-power processor with limited processing capability for long-term monitoring, therefore most of the moving object detection methods cannot achieve real-time performance [
13]. Thus, developing a practical and affordable video surveillance system for real-time moving object detection is meaningful and valuable. The well-known video surveillance system W
4 was an early attempt to detect and track people in an outdoor environment [
14]. In the past decade, some state-of-the-art moving object detection systems have been developed.
Mori et al. [
15] presented an FPGA-based omnidirectional vision system based on a background subtraction algorithm for moving object detection in mobile robotic applications. The detection error was about 24% at a distance of 200 cm, thus the system was not suitable for long-distance detection. Wang et al. [
16] developed a real-time small moving object detection system based on infrared images. The system uses an FPGA chip and a DSP chip as the main computing elements, and the detection speed can reach 22 fps. Nevertheless, the system is not suitable for all-day conditions, especially when the environmental temperature is high, such as when the sun is shining. Moon et al. [
17] implemented an SoC system for real-time moving object detection based on a 32 bit processor ARM922T and an FPGA. The produced SoC system can reach a speed of 15 fps; however, when detecting a moving object, the system has difficulty in preventing the moving object area from reacting sensitively to the illuminance change of an identical object since it detects movement by a difference from the previous image, i.e., the system is sensitive to illuminance changes. Dong et al. [
18] designed a moving object tracking system by combining classic object detection and tracking algorithms. Since QiTianM4330 desktop was employed, it was a high-power-consumption system and was not suitable for monitoring moving objects outdoors or in the wild.
Iqbal et al. [
19] presented a quadcopter-based solution to monitor desired premises for any unusual activities based on R-CNN. However, images captured by the aerial surveillance system must be transmitted to a workstation on the ground for analysis, which affects the real-time performance. Alam et al. [
20] proposed a real-time surveillance system using a low-cost drone (UAV), in which the large computation tasks were moved to the cloud while keeping limited computation on-board the UAV device using edge computing techniques. Since the video streams must be transmitted to the cloud, there would exist end-to-end delay. Angelov et al. [
21] designed and implemented a moving object detection system AURORA mounted on a DJI hexacopter S800. The system was able to detect, by on-board processing, any moving objects at a rate of 5 fps while at the same time sending only important data to a control station located on the ground. Rodriguez-Canosa et al. [
22] developed a real-time moving object detection and track system DATMO on an onboard UAV computer based on optical flow. Although the camera recorded images at a rate of 30 fps, the moving object detection frequency could only reach 5 to 10 fps. Huang et al. [
13] presented a visual-inertial drone system REDBEE that runs on the Snapdragon Flight board for real-time moving object detection. The major shortcoming of aerial surveillance systems based on UAV platforms is that the power consumption of the whole system is very large; for example, the power consumption of a DJI hexacopter S800 is more than 720 W, which results in the drones flying for less than twenty minutes, thus long-term monitoring cannot be achieved.
In summary, the drawbacks of existing video surveillance systems are analyzed from several aspects. Firstly, the time conditions (day or night) are not considered; most of the video surveillance systems cannot work at night [
15,
17,
18,
19,
20,
21,
22]. Secondly, some video surveillance systems are only suitable for indoor or close-range monitoring [
15,
17,
18]. Thirdly, some systems have a low detection speed of only about 5~10 fps, though the acquisition rate of their cameras can reach 30 fps or faster [
21,
22]. Moreover, some of the aerial surveillance systems depend on a workstation on the ground to process video streams, which would result in end-to-end delay [
19,
20]. Finally, the power consumption of the aerial surveillance systems is very large since the drones are energy-consuming vehicles [
13,
19,
20,
21,
22], which thereby makes long-term monitoring impossible.