Wild Animal Information Collection Based on Depthwise Separable Convolution in Software Deﬁned IoT Networks

: The wild animal information collection based on the wireless sensor network (WSN) has an enormous number of applications, as demonstrated in the literature. Yet, it has many problems, such as low information density and high energy consumption ratio. The traditional Internet of Things (IoT) system has characteristics of limited resources and task speciﬁcity. Therefore, we introduce an improved deep neural network (DNN) structure to solve task speciﬁcity. In addition, we determine a programmability idea of software-deﬁned network (SDN) to solve the problems of high energy consumption ratio and low information density brought about by low autonomy of equipment. By introducing some advanced network structures, such as attention mechanism, residuals, depthwise (DW) convolution, pointwise (PW) convolution, spatial pyramid pooling (SPP), and feature pyramid networks (FPN), a lightweight object detection network with a fast response is designed. Meanwhile, the concept of control plane and data plane in SDN is introduced, and nodes are divided into different types to facilitate intelligent wake-up, thereby realizing high-precision detection and high information density of the detection system. The results show that the proposed scheme can improve the detection response speed and reduce the model parameters while ensuring detection accuracy in the software-deﬁned IoT networks. problem, after the DW operation, MobileNet uses the PW convolution with the convolution kernel of 1 × 1 to increase the dimension, and the feature information of a single dimension is integrated through the PW operation. In the improved MobileNetV2 network, the author analysed the feature information to be transmitted in the low-dimensional space and found that information loss and image distortion are prone to occur at this time, so the expansion coefﬁcient is introduced. Before MobileNetV2 performs DW, it ﬁrst expands the feature dimension to six times to alleviate the distortion of low-dimensional features, and ﬁnally uses PW convolution for global feature fusion. 40 m × 40 m and the duration of the video is 10 s. In order to reﬂect the advantages of SDN, we set self-organizing network areas of different sizes: 20%, 25%, 30% and 35%. The coverage area only affects the deployment method with SDN structure because when using the WSN deployment method, the node’s transmission power is often given in advance, which is hardware-dependent and not programmable. At the same time, 1% random noise is introduced.


Introduction
Traditional wild animal information collection system is mainly deployed in the form of wireless sensor network (WSN), and the most widely used method is large-scale deployment of infrared cameras [1][2][3]. The method of infrared camera monitoring for information collection has the advantages of simple use and convenient operation [4], but also brings about the problem of insufficient intelligence. Sensor nodes are often highly hardware-dependent, i.e., the deployment and adjustment of equipment is dependent on researchers, and nodes cannot be repetitively programmed to achieve intelligent control. Additionally, in the complex environment of the field, the objects that we want to detect are often known groups. We only need to collect the specified species information [5]. However, due to the complexity of the wild environment and species diversity, false triggering for a variety of other reasons often occurs, causing additional energy consumption by cameras with limited resources, and adding additional workload for subsequent screening of collected information.
For a long time, collecting as helpful information as possible has been the focus of wild animal data collection research [6][7][8]. Traditional data collection researchers optimize from the perspective of node deployment and information routing. They adjust the position, height, deployment density and other influencing factors to make the camera capture as helpful information as possible. At the same time, they adjust the transmit power and receive power from sensor nodes to change the routing of information to reduce the energy consumption ratio [9,10]. These researches focus on information routing and node deployment. The detection method still relies on infrared triggering, and the data processing and data forwarding still depend on the deployment of researchers. The traditional WSN-based collection mode does not effectively resolve the problem of extra energy consumption caused by false triggering.
With artificial intelligence (AI) development, enormous research has begun to be applied and deployed across industries. Therefore, we use the target detection algorithm to collect specific animal information, and the target detection network only responds to the target animal, which can effectively reduce the occurrence of false triggering. At the same time, the deployment of traditional wireless sensor network nodes is often hardware dependent, so the autonomy and the ability of self-organizing network are minimal [11]. We learn from the advanced network structure of SDN to enhance the autonomy and dynamic networking capabilities of nodes, and realize a de-hardware software defined IoT system [12]. The sensors in the data layer summarize the current status information to the control layer for data processing. The control layer makes global decisions and sends instructions to nodes to realize an intelligent programmable wireless sensor network structure. Figure 1 shows the network structure of the SDN-assisted wild animal information collection system. The data layer is composed of camera nodes, which are responsible for collecting information such as sensors. When the camera detects the presence of target animals, it sends the status information to the control layer. The control layer comprises programmable devices, which can be PC or raspberry PI and other controllers with strong information processing ability. It sends instructions to the camera nodes in the data layer to realize the self-organizing network between sensor nodes. The top application layer implements data summary and visualization so that users can use it conveniently. We focus on building a responsive, lightweight target detection network, transplanted to sensor nodes in the data layer to realize intelligent data collection.
Electronics 2021, 10,2091 2 of 17 For a long time, collecting as helpful information as possible has been the focus of wild animal data collection research [6][7][8]. Traditional data collection researchers optimize from the perspective of node deployment and information routing. They adjust the position, height, deployment density and other influencing factors to make the camera capture as helpful information as possible. At the same time, they adjust the transmit power and receive power from sensor nodes to change the routing of information to reduce the energy consumption ratio [9,10]. These researches focus on information routing and node deployment. The detection method still relies on infrared triggering, and the data processing and data forwarding still depend on the deployment of researchers. The traditional WSN-based collection mode does not effectively resolve the problem of extra energy consumption caused by false triggering.
With artificial intelligence (AI) development, enormous research has begun to be applied and deployed across industries. Therefore, we use the target detection algorithm to collect specific animal information, and the target detection network only responds to the target animal, which can effectively reduce the occurrence of false triggering. At the same time, the deployment of traditional wireless sensor network nodes is often hardware dependent, so the autonomy and the ability of self-organizing network are minimal [11]. We learn from the advanced network structure of SDN to enhance the autonomy and dynamic networking capabilities of nodes, and realize a de-hardware software defined IoT system [12]. The sensors in the data layer summarize the current status information to the control layer for data processing. The control layer makes global decisions and sends instructions to nodes to realize an intelligent programmable wireless sensor network structure. Figure 1 shows the network structure of the SDN-assisted wild animal information collection system. The data layer is composed of camera nodes, which are responsible for collecting information such as sensors. When the camera detects the presence of target animals, it sends the status information to the control layer. The control layer comprises programmable devices, which can be PC or raspberry PI and other controllers with strong information processing ability. It sends instructions to the camera nodes in the data layer to realize the self-organizing network between sensor nodes. The top application layer implements data summary and visualization so that users can use it conveniently. We focus on building a responsive, lightweight target detection network, transplanted to sensor nodes in the data layer to realize intelligent data collection. To achieve AI-driven in wild animal information collection, it is necessary to introduce a suitable DNN algorithm. The first task is to design a lightweight object detection network structure, transplanted to nodes or embedded devices for specified information collection. With the development of semiconductor chips, the computing To achieve AI-driven in wild animal information collection, it is necessary to introduce a suitable DNN algorithm. The first task is to design a lightweight object detection network structure, transplanted to nodes or embedded devices for specified information collection. With the development of semiconductor chips, the computing speed of computers is getting ever faster. Researchers have started to study neural networks in-depth. From ResNet101 [13] to Faster RCNN [14][15][16][17], the network parameters are increasing, and even traditional portable notebooks can no longer run these networks. The development of artificial intelligence is gradually limited to theoretical research in the laboratory. Therefore, it is of great value and research significance to design a lightweight network suitable for transplantation embedded devices or IoT devices [18,19]. In this way, more helpful information can be collected with a lower energy consumption ratio. At the same time, the detection system will adopt a structure that separates data and control, based on SDN [20], so it can achieve intelligent detection and data collection [21][22][23]. Based on this, we designed a fast, lightweight network structure for wild animal information collection to realize AI-driven goals in SDN-assisted IoT systems.

Related Works
The neural network's research in image processing mainly focuses on classification, object detection and instance segmentation. Considering the specific application scenario of animal information collection in the wild environment, we mainly introduce algorithms related to object detection.
Typically, the high-precision algorithms in object detection are based on the two-stage detection process. The classic RCNN algorithm [16] filters out the region of interest (ROI) on the original image through Selective Search [24]. It then scales the obtained ROIs to a specific size, an input to the convolutional neural network (CNN) to extract each proposed region feature. In this way, we can obtain the feature vector of each ROI, and then feed the feature vector to the support vector machine (SVM) to get the classification result [25]. Then, we use Non-Maximum Suppression (NMS) to filter out the appropriate bounding-box, and finally constantly correct the position of the bounding-box to get a more accurate bounding-box. The algorithm based on selective search selects 2000 ROIs for each image, which brings a massive amount of data to the subsequent CNN feature extraction network. In the actual test, it takes forty seconds to detect an image. Applying the RCNN algorithm to embedded devices is challenging, and it is even more impossible to transplant it to IoT devices.
Fast RCNN [15] is an improved version of the RCNN algorithm. It first performs convolution operations on the input image to obtain ROIs, effectively avoiding the problem that every ROI must be input to the convolutional neural network. But it still uses selective search to get ROI, which is a very time-consuming operation. Therefore, Fast RCNN only alleviates too long inference time but does not effectively solve the problem. Faster RCNN has made significant progress in the optimization of inference time. It uses the region proposal network (RPN) structure instead of selective search and introduces the anchor concept. The ROI obtained by RPN is also based on the feature map extracted from the previous CNN backbone, and the detection bases on ROI pooling. In the whole process, the original image only needs to pass through the CNN network once, and all operations are performed based on the feature map of the original image. The RPN network is recommended based on the anchor box. When the image is divided into 38 × 38 grids, each grid will establish nine priori boxes for 12,996 priori boxes. Although the time is improved compared to Fast RCNN, the generation of ROI still consumes a lot of time and resources. The mainstream two-stage network introduced above has high accuracy in the object detection task. Still, the parameter control and response time are not suitable for the actual scenarios of embedded devices. It is more difficult to get practical applications in resource-limited networks.
The one-stage network has the advantage in response time. Unlike the two-stage method, the one-stage idea is to directly predict and classify the candidate frames at each image position without generating some candidate frames in advance. At present, the most mainstream one-stage detection algorithms are YOLO [26] and Single Shot MultiBox Detector (SSD) [27,28]. The detection process of YOLO is divided into the following steps. First, divide the original image into M × M small grids. Each grid generates N bounding boxes, gives the predicted (x, y, w, h) in each bounding box and give the confidence of C categories. The output result of an image is a vector of M × M × (5 × N + C). Since YOLO has meshed the entire image, there is no need for sliding windows or recommended candidate frames, and its response time has been dramatically improved. Still, the accuracy of classification and positioning accuracy has dropped a lot.
Subsequent improvements to the algorithm include introducing other tricks, such as residual structure, batch normalization (BN), and data enhancement. The improved YOLOV3 [29] also introduced the feature pyramid networks structure [30] to achieve higher detection accuracy. Now the one-stage algorithm can also approximately meet the requirements of the task in terms of detection accuracy. Meanwhile, the reasoning time of the model is also significantly reduced, and the detection of dozens of frames can be achieved. However, the model parameters and required memory still have large problems. It is still difficult to get practical applications for embedded facilities in IoT systems and devices with minimal resources.
In recent years, some researchers have begun to deploy lightweight networks on mobile devices. The two most popular lightweight network structures are MobileNet [31][32][33] and ShuffleNet [34,35]. Some researchers have implemented a more lightweight network design in the direction of model compression [36]. The Google team first proposed the MobileNet network based on depthwise convolution in 2017. By analyzing the structure and parameters of the classic network, researchers have found that the parameters of the model are mainly concentrated on the convolution operation. The traditional convolution is shown in Figure 2. When performing feature extraction and dimension changes, the convolution kernel is consistent with the depth information of the feature map of the previous layer, which facilitates the fusion of features of different dimensions, so that the global components extracted by the different convolution kernels of the last layer can be obtained. The ensuing problem is the considerable number of parameters. MobileNet uses depthwise convolution, as shown in Figure 3. When performing feature extraction and dimensional transformation, each convolution kernel only corresponds to a feature map of the specific dimension, so that the depth information of the feature map can be separated. However, only extracting feature information in a single dimension ignores the features of other spatial dimensions and cannot effectively extract global features. To solve this problem, after the DW operation, MobileNet uses the PW convolution with the convolution kernel of 1 × 1 to increase the dimension, and the feature information of a single dimension is integrated through the PW operation. In the improved MobileNetV2 network, the author analysed the feature information to be transmitted in the low-dimensional space and found that information loss and image distortion are prone to occur at this time, so the expansion coefficient is introduced. Before MobileNetV2 performs DW, it first expands the feature dimension to six times to alleviate the distortion of low-dimensional features, and finally uses PW convolution for global feature fusion. dropped a lot. Subsequent improvements to the algorithm include introducing other tricks, residual structure, batch normalization (BN), and data enhancement. The im YOLOV3 [29] also introduced the feature pyramid networks structure [30] to higher detection accuracy. Now the one-stage algorithm can also approximately m requirements of the task in terms of detection accuracy. Meanwhile, the reasoning the model is also significantly reduced, and the detection of dozens of frames achieved. However, the model parameters and required memory still hav problems. It is still difficult to get practical applications for embedded facilities systems and devices with minimal resources.
In recent years, some researchers have begun to deploy lightweight netw mobile devices. The two most popular lightweight network structures are MobileN 33] and ShuffleNet [34,35]. Some researchers have implemented a more ligh network design in the direction of model compression [36]. The Google tea proposed the MobileNet network based on depthwise convolution in 2017. By an the structure and parameters of the classic network, researchers have found t parameters of the model are mainly concentrated on the convolution operati traditional convolution is shown in Figure 2. When performing feature extract dimension changes, the convolution kernel is consistent with the depth informatio feature map of the previous layer, which facilitates the fusion of features of d dimensions, so that the global components extracted by the different convolution of the last layer can be obtained. The ensuing problem is the considerable num parameters. MobileNet uses depthwise convolution, as shown in Figure 3 performing feature extraction and dimensional transformation, each convolution only corresponds to a feature map of the specific dimension, so that the depth info of the feature map can be separated. However, only extracting feature informat single dimension ignores the features of other spatial dimensions and cannot eff extract global features. To solve this problem, after the DW operation, MobileNet u PW convolution with the convolution kernel of 1 × 1 to increase the dimension, feature information of a single dimension is integrated through the PW operation improved MobileNetV2 network, the author analysed the feature informatio transmitted in the low-dimensional space and found that information loss and distortion are prone to occur at this time, so the expansion coefficient is introduced MobileNetV2 performs DW, it first expands the feature dimension to six times to a the distortion of low-dimensional features, and finally uses PW convolution fo feature fusion.   ShuffleNet also uses a DW convolution operation. It does not use point-by-p convolution when acquiring global features. Instead, it uses group convolution performs PW convolution in the group through channel shuffling. The appr global spatial dimension feature is obtained, and the parameters are significantly In the current mainstream lightweight networks, the depthwise convolution is us basic block, and different algorithms use different methods for global feature fus lightweight network based on depthwise convolution solves the problem of e parameter amount. As a price, this type of model sacrifices a certain degree of a This new convolution method makes the deep learning algorithm model possi transplanted to embedded devices, thus realizing AI-assisted intelligent info collection.
Starting from the actual wild animal information collection scene, the comp the environment is mainly reflected in the severe occlusion between the target an the background environment. Secondly, due to the physiological character animals, their external characteristics, such as hair, shape, size, colour, etc., of substantial similarities with their living environment. In addition, animals tend more quickly, and the time to capture information is relatively short. This paper p an optimized network structure based on depthwise separable convolution. We that the network structure is better than the traditional two-stage network in parameters and better than the conventional one-stage fast detection network in d accuracy. At the same time, the introduction of the SDN control plane will help dynamic self-organizing IoT system. The focus of our research is the construc lightweight target detection network suitable for sensor nodes. The contribution article are summarized as follows: 1. We propose a fast-response lightweight network model that can be transpl embedded devices, such as ARM series development boards, Raspb development boards, etc. This network effectively solves the problems of hig consumption ratio and low information density in traditional wild information collection. For the backbone network, we use MobileNetV2 [ block design based on depthwise separable convolution significantly red parameters. In the Neck part, we use the simplified spatial pyramid pooli structure of deep separation convolution [17], and the feature fusion module [37] is replaced with an improved FPN structure [30] to achieve another redu parameters and effective feature fusion. 2. We use the public Oregon Wildlife dataset. The data is collected in the wild environment, effectively reflecting the model's performance in the environment. We carefully analysed the dataset and selected five animals more difficult to detect for training, including black bears with a single featu ShuffleNet also uses a DW convolution operation. It does not use point-by-point PW convolution when acquiring global features. Instead, it uses group convolution, which performs PW convolution in the group through channel shuffling. The approximate global spatial dimension feature is obtained, and the parameters are significantly reduced. In the current mainstream lightweight networks, the depthwise convolution is used as the basic block, and different algorithms use different methods for global feature fusion. The lightweight network based on depthwise convolution solves the problem of excessive parameter amount. As a price, this type of model sacrifices a certain degree of accuracy. This new convolution method makes the deep learning algorithm model possible to be transplanted to embedded devices, thus realizing AI-assisted intelligent information collection.
Starting from the actual wild animal information collection scene, the complexity of the environment is mainly reflected in the severe occlusion between the target animal and the background environment. Secondly, due to the physiological characteristics of animals, their external characteristics, such as hair, shape, size, colour, etc., often have substantial similarities with their living environment. In addition, animals tend to move more quickly, and the time to capture information is relatively short. This paper proposes an optimized network structure based on depthwise separable convolution. We suggest that the network structure is better than the traditional two-stage network in terms of parameters and better than the conventional one-stage fast detection network in detection accuracy. At the same time, the introduction of the SDN control plane will help build a dynamic self-organizing IoT system. The focus of our research is the construction of a lightweight target detection network suitable for sensor nodes. The contributions of this article are summarized as follows: We propose a fast-response lightweight network model that can be transplanted to embedded devices, such as ARM series development boards, Raspberry Pi development boards, etc. This network effectively solves the problems of high energy consumption ratio and low information density in traditional wild animal information collection. For the backbone network, we use MobileNetV2 [32]. The block design based on depthwise separable convolution significantly reduces the parameters. In the Neck part, we use the simplified spatial pyramid pooling (SPP) structure of deep separation convolution [17], and the feature fusion module PANET [37] is replaced with an improved FPN structure [30] to achieve another reduction of parameters and effective feature fusion.

2.
We use the public Oregon Wildlife dataset. The data is collected in the wild natural environment, effectively reflecting the model's performance in the actual environment. We carefully analysed the dataset and selected five animals that are more difficult to detect for training, including black bears with a single feature, wild ocelot that are very similar to the background environment, fast-moving elk, dangerous and aggressive grey wolves, and the nocturnal raccoons. Experimental results show that our network has a high recall rate, precision rate and high confidence in actual complex scenes.
The organization structure of the article is proceeds as follows: The third part introduces the materials and methods in detail, including the network structure, the composition of the dataset, and the definition of the loss function. The fourth and fifth part outline the results and discussion, in which we compare the proposed network with mainstream one-stage and lightweight networks and present model analysis and display results. In the final section, we summarize the application value and research significance of the network structure proposed in this paper.

Materials and Methods
One of the advantages of SDN structure is the programmable network structure. Many researchers have designed many schemes superior to traditional algorithms based on SDN structure [38][39][40]. Specifically, in the animal information collection system, we can pay more attention to the data collection task of the data layer, without too much consideration on how to forward and further process the data collected by sensors. Intelligent routing and self-organizing network of sensor nodes in the data layer can be realized by digital programming in the control layer. When we deploy sensor nodes, there is no need to pre-program the hardware in advance. With the change of the field environment, the status of sensor nodes used to collect information also changed accordingly. When the active node detects the specified animal, the control layer will process the result, and the corresponding surrounding nodes will be activated and enter the HIGH-DEFINITION information recording mode. Unlike WSN networks, the relationships between nodes are pre-set and do not have the capability of intelligent networking. When no target animal appears, the node enters the hibernation mode, and only the active node performs the detection task. With the help of SDN's control and data separation advantages, we will focus on the construction of the AI algorithm target detection network below. Figure 4 shows the basic module of the backbone network, which consists of three parts: PW operation that promotes the dimension of feature information, depthwise separable convolution DW, and PW that reduces the dimension of convolution features. At the same time, the short-cut of classic ResNet is retained.

Network Structure Design
The organization structure of the article is proceeds as follows: The third part introduces the materials and methods in detail, including the network structure, the composition of the dataset, and the definition of the loss function. The fourth and fifth part outline the results and discussion, in which we compare the proposed network with mainstream one-stage and lightweight networks and present model analysis and display results. In the final section, we summarize the application value and research significance of the network structure proposed in this paper.

Materials and Methods
One of the advantages of SDN structure is the programmable network structure. Many researchers have designed many schemes superior to traditional algorithms based on SDN structure [38][39][40]. Specifically, in the animal information collection system, we can pay more attention to the data collection task of the data layer, without too much consideration on how to forward and further process the data collected by sensors. Intelligent routing and self-organizing network of sensor nodes in the data layer can be realized by digital programming in the control layer. When we deploy sensor nodes, there is no need to pre-program the hardware in advance. With the change of the field environment, the status of sensor nodes used to collect information also changed accordingly. When the active node detects the specified animal, the control layer will process the result, and the corresponding surrounding nodes will be activated and enter the HIGH-DEFINITION information recording mode. Unlike WSN networks, the relationships between nodes are pre-set and do not have the capability of intelligent networking. When no target animal appears, the node enters the hibernation mode, and only the active node performs the detection task. With the help of SDN's control and data separation advantages, we will focus on the construction of the AI algorithm target detection network below. Figure 4 shows the basic module of the backbone network, which consists of three parts: PW operation that promotes the dimension of feature information, depthwise separable convolution DW, and PW that reduces the dimension of convolution features. At the same time, the short-cut of classic ResNet is retained. The most important part is the target detection network deployed at the data layer nodes. Our proposed network structure designed for information collection of wild The most important part is the target detection network deployed at the data layer nodes. Our proposed network structure designed for information collection of wild animals is named OurNet, shown in Figure 5. The network structure we created is divided into four parts, the backbone network, SPP as the additional module of the neck, FPN as the feature fusion module and the head structure based on YOLOV3 [30]. The backbone network comprises three primary modules: standard convolution modules, separated convolution with stride = 1, and separated convolution with stride = 2. In the original feature extraction layer, we still use the standard convolution structure to preserve the original image's feature information. Using DW convolution in the first layer of feature extraction will not effectively extract the feature's spatial information while ignoring the relevance of elements in different dimensions. Then, we send the feature map to the DW module. Before performing the DW convolutions, we first use 1 × 1 convolution to map the feature map to a high-dimensional space, which can alleviate the feature loss problem caused by the Relu function in the depth separation. After the feature map is mapped to the high-dimensional space, we perform depthwise separable convolution, and the convolution kernel size is 3 × 3 in each. The linear combination of the 3 × 3 convolution kernel designs obtains a larger receptive field and obtains better nonlinearity. At the same time, the design of the unified convolution kernel also provides convenience for the hardware threshold circuit design of the algorithm or the design of FPGA [41][42][43]. Then, we perform feature fusion with the use of 1 × 1 convolution again to achieve dimensionality reduction. animals is named OurNet, shown in Figure 5. The network structure we created is divided into four parts, the backbone network, SPP as the additional module of the neck, FPN as the feature fusion module and the head structure based on YOLOV3 [30]. The backbone network comprises three primary modules: standard convolution modules, separated convolution with stride = 1, and separated convolution with stride = 2. In the original feature extraction layer, we still use the standard convolution structure to preserve the original image's feature information. Using DW convolution in the first layer of feature extraction will not effectively extract the feature's spatial information while ignoring the relevance of elements in different dimensions. Then, we send the feature map to the DW module. Before performing the DW convolutions, we first use 1 × 1 convolution to map the feature map to a high-dimensional space, which can alleviate the feature loss problem caused by the Relu function in the depth separation. After the feature map is mapped to the high-dimensional space, we perform depthwise separable convolution, and the convolution kernel size is 3 × 3 in each. The linear combination of the 3 × 3 convolution kernel designs obtains a larger receptive field and obtains better nonlinearity. At the same time, the design of the unified convolution kernel also provides convenience for the hardware threshold circuit design of the algorithm or the design of FPGA [41][42][43]. Then, we perform feature fusion with the use of 1 × 1 convolution again to achieve dimensionality reduction.  After the PW-DW-PW operation, the parameters reduce to 1/8 to 1/9 of the original. Assume that the size of the feature map is D H , D W , and the size of the convolution kernel is D K , D K . The input feature dimension is M, the output feature dimension is N, the parameter quantity of the standard convolution is D K × D K × M × N, and the calculation quantity is D K × D K × M × N × D H × D W . If PW-DW-PW convolution is used, the parameters and calculation amount will be reduced to D K × D K × M + M × N and

Network Structure Design
The 1 × 1 convolution here is used to alleviate the problem of non-circulation of feature information space dimensions caused by DW convolution. In addition, batch normalization ensures the smooth transmission of gradient information while accelerating convergence [44]. Different from the original MobileNet inverted residuals design, in the comparison network, the network structure we designed is not directly passed to the next layer of convolution after the short cut operation CONCAT. Instead, the point attention mechanism is used to realize the reenhancement of the features without changing the dimensional information and retaining the original features.
To ensure the accuracy of the target detection network, we need additional feature fusion operations. There are four main methods of feature fusion and enhancement. The traditional image pyramid is shown in Figure 6a. Each size is extracted and predicted separately. It significantly increases the inference time and the number of parameters of the model. The single output feature pyramid, shown in Figure 6b, predicts the feature graph of the last dimension, and after multiple convolutional pooling operations, the information of small targets is ignored. Although the parameters and reasoning time of the model become shorter, the accuracy is lower. Figure 6c performs feature prediction of multiple sizes and uses feature maps instead of input maps to perform feature prediction so that the parameters reduce. Due to the prediction of multiple sizes and outputs, it is possible not to miss too many features. Figure 6d focuses on the correlation between features of different sizes, and achieves higher detection accuracy through multi-size feature fusion. Meanwhile, there will be a slight increase in the number of parameters. to alleviate the problem of non-circulation of feature information space dimensions caused by DW convolution. In addition, batch normalization ensures the smooth transmission of gradient information while accelerating convergence [44]. Different from the original MobileNet inverted residuals design, in the comparison network, the network structure we designed is not directly passed to the next layer of convolution after the short cut operation CONCAT. Instead, the point attention mechanism is used to realize the reenhancement of the features without changing the dimensional information and retaining the original features.
To ensure the accuracy of the target detection network, we need additional feature fusion operations. There are four main methods of feature fusion and enhancement. The traditional image pyramid is shown in Figure 6a. Each size is extracted and predicted separately. It significantly increases the inference time and the number of parameters of the model. The single output feature pyramid, shown in Figure 6b, predicts the feature graph of the last dimension, and after multiple convolutional pooling operations, the information of small targets is ignored. Although the parameters and reasoning time of the model become shorter, the accuracy is lower. Figure 6c performs feature prediction of multiple sizes and uses feature maps instead of input maps to perform feature prediction so that the parameters reduce. Due to the prediction of multiple sizes and outputs, it is possible not to miss too many features. Figure 6d focuses on the correlation between features of different sizes, and achieves higher detection accuracy through multi-size feature fusion. Meanwhile, there will be a slight increase in the number of parameters. The original version of YOLO V4 uses PANet based on standard convolution as a feature fusion module [45], YOLO V4 selects 1024-dimensional, 512-dimensional, and 256dimension feature maps as input. Each operation of high-dimensional features will bring about a large number of increases in parameters. We use convolutional visualization to prune the network structure from the perspective of feature maps. In the wild animal information collection scene, we can obtain good detection results by using 512dimensional and 256-dimensional features and continue to use PW-DW-PW convolution instead of standard convolution to reduce parameters further. The feature fusion part uses a one-way FPN structure shown in Figure 6d to realize the multiplexing of highdimensional features and the detection of multiple receptive fields to ensure detection accuracy. The original version of YOLO V4 uses PANet based on standard convolution as a feature fusion module [45], YOLO V4 selects 1024-dimensional, 512-dimensional, and 256-dimension feature maps as input. Each operation of high-dimensional features will bring about a large number of increases in parameters. We use convolutional visualization to prune the network structure from the perspective of feature maps. In the wild animal information collection scene, we can obtain good detection results by using 512-dimensional and 256-dimensional features and continue to use PW-DW-PW convolution instead of standard convolution to reduce parameters further. The feature fusion part uses a one-way FPN structure shown in Figure 6d to realize the multiplexing of high-dimensional features and the detection of multiple receptive fields to ensure detection accuracy.

Dataset Description
All the data in the Oregon Wildlife Dataset is collected in the real environment in the wild and is the primary material for the research of wildlife researchers. The data set has 20 types of animals, including crows, bald eagles flying in the sky, bobcats and mountain raccoons in the snow, red foxes living in the grassland, cougars and other rare animals living in various complex environments, a total of 14,013 images. The data is not labelled. We use professional image labelling tools to convert the original image into a format suitable for the object detection network, and select the five most challenging animal trainings to evaluate the performance of the network structure proposed in this paper. During training, we perform data enhancement processing, random occlusion, size scaling, cropping and stitching to simulate the harsh natural environment to the greatest extent.

Definition of Loss Function
As is well known, the loss function is very important in deep learning. We can use loss function to calculate the difference between the forward calculation result of each iteration of neural network and the real value, so as to guide the next training in the right direction. The definition of the loss function of the network model proposed in this paper mainly consists of three parts: regression box loss function, confidence loss function and classification loss function, Equation (1). The variable K represents dividing the original image into K × K small grids, and each grid generates M bounding boxes.
CiouError + con f _loss + class_loss (1) The position of the regression box is determined by x, y, w, h. The starting coordinates of the box are denoted by x and y, w and h are the width and height of the prediction frame. When calculating the loss function of the regression box, YOLO V3 corrects the position of the regression box generated by the tiny target by adding a weight coefficient [29], and the cross-entropy function calculates the loss function. We use the Ciou loss definition, Equation (2).
When using Intersection over Union (iou) to calculate the loss function, only the case where the prediction box intersects the real box is considered. When there is no intersection between the two, the loss value is zero and no gradient information returns. Secondly, the iou loss only focuses on the size of the intersection area, and ignores the positional relationship between the predicted frame and the real frame. The Ciou loss that we use fully considers the positional correlation and size similarity between the regression box and the label box. We introduce R Ciou loss, defined as (3).
In the above formula, b and b gt , respectively, represent the centre point of the predicted frame and the true frame, w gt and h gt are the width and height of the real frame, and ρ represents the calculation of the Euclidean distance between two centre points. The diagonal distance of the smallest closure area contains the prediction box and the ground truth box is denoted by c, the weight function is represented by α, and ν is used to measure the similarity of the aspect ratio in Equation (4).
It provides adequate direction information for the position movement and size adjustment of the prediction box, and the multi-angle constraint can also speed up its convergence speed. The confidence loss in Equation (5) and the classification loss function in Equation (6) are both defined by the binary cross entropy function. The parameter I ij obj represents whether the j-th anchor box of the i-th grid is responsible for the detection of this object. Its value is 1 or 0, and the same is true for the parameter I ij noobj . The parameter C j i is related to the confidence level. The value is determined by whether the bounding box of the grid cell is responsible for predicting an object. p i (c) is the classification probability of the objects in the regression box. Add the above three losses to get the global loss function of the network, which is defined as follows:

Results
When training the model, we use transfer learning for the part of the network structure that can use the pre-trained model [46,47], and load the pre-trained model of VOC2007 to speed up the convergence of the model. At the same time, the use of transfer learning avoids convergence to an optimal local solution with unsatisfactory effects, at least as much as possible.
We compare the network structure proposed in this article with the advanced lightweight network MobileNet series and one-stage fast response network YOLO V4. The detection accuracy is shown in Figure 7. The figure shows the performance indicators of different models. The bar graph represents the AP value of different wild animal categories, and the line graph represents the map value of the model. Results indicate that the detection accuracy of our proposed network structure is far superior to the traditional lightweight network. Our network structure map is as high as 89.52%, while the map of MobileNet V2 is only 77.9%. Compared with the one-stage network YOLO, our detection accuracy is better than the YOLO V4 Tiny map at 77.33%. We also add an attention mechanism to the baseline network. The results show that an increase of 11.6% in the parameters only obtain an increase of 2% in accuracy. Considering the balance of parameters and accuracy, the attention module we designed is unsuitable for our backbone network. We deliberately selected some photos with detection difficulty, three from each category of animals, to show the excellent detection performance of the network structure proposed in this article. As shown in Figure 8, especially in the actual complex scenes with severe occlusion, blurring caused by high-speed motion, and the target and the background are very similar. We deliberately selected some photos with detection difficulty, three from each category of animals, to show the excellent detection performance of the network structure proposed in this article. As shown in Figure 8, especially in the actual complex scenes with severe occlusion, blurring caused by high-speed motion, and the target and the background are very similar. We deliberately selected some photos with detection difficulty, three from each category of animals, to show the excellent detection performance of the network structure proposed in this article. As shown in Figure 8, especially in the actual complex scenes with severe occlusion, blurring caused by high-speed motion, and the target and the background are very similar.  The above target animal detection results demonstrate that the proposed network has good detection capabilities. Although our main work is to design a lightweight target detection network suitable for deployment to nodes, considering that the system's structure refers to SDN, we also carried out a simulation design and presented the result of reducing energy consumption with the aid of SDN. Figure 9a shows the node working status of traditional WSN based on area coverage. Red notes indicate that the node is in the active working state, blue notes indicate that the node is in a dormant state, and the circle area represents the transmitting power coverage range of the central node. When the target animal moves from node A to node B, the node is awakened. Figure 9b shows the operational status of nodes in the system assisted by SDN. Nodes in the data layer have self-organizing network capability within the range of 20%, and the circle formed by dashed lines represents the signal coverage range of nodes under the control of SDN. It is apparent that in the SDN-assisted system, fewer redundant nodes are awakened to realize the reduction of energy consumption.
In order to clearly delineate the results, Table 1 shows the simulation results of different deployment modes over one-day. In the simulation design, we use the energy consumption parameters of infrared cameras, which are widely used in the animal monitoring field. The camera has two operating states: the operating current in HD mode is 650 mA, of which 500 mA is the additional current consumption when the infrared LED is on, the operating current in the sleep state is 250 µA, the power to forward information is 300 mW, and the operating voltage is 12 V. In the 2 km × 2 km rectangular area, we deployed 30 nodes to perform information collection. To achieve a multi-angle collection of animals, without the assistance of SDN, the response area of the trigger signal is defined as the pre-defined area 40 m × 40 m and the duration of the video is 10 s. In order to reflect the advantages of SDN, we set self-organizing network areas of different sizes: 20%, 25%, 30% and 35%. The coverage area only affects the deployment method with SDN structure because when using the WSN deployment method, the node's transmission power is often given in advance, which is hardware-dependent and not programmable. At the same time, 1% random noise is introduced. structure refers to SDN, we also carried out a simulation design and presented the re of reducing energy consumption with the aid of SDN. Figure 9a shows the node working status of traditional WSN based on area cover Red notes indicate that the node is in the active working state, blue notes indicate tha node is in a dormant state, and the circle area represents the transmitting power cove range of the central node. When the target animal moves from node A to node B, the n is awakened. Figure 9b shows the operational status of nodes in the system assisted by SDN. No in the data layer have self-organizing network capability within the range of 20%, and circle formed by dashed lines represents the signal coverage range of nodes under control of SDN. It is apparent that in the SDN-assisted system, fewer redundant node awakened to realize the reduction of energy consumption. In order to clearly delineate the results, Table 1 shows the simulation result different deployment modes over one-day. In the simulation design, we use the en consumption parameters of infrared cameras, which are widely used in the an monitoring field. The camera has two operating states: the operating current in HD m is 650 mA, of which 500 mA is the additional current consumption when the infrared is on, the operating current in the sleep state is 250 μA, the power to forward informa is 300 mW, and the operating voltage is 12 V. In the 2 km × 2 km rectangular area deployed 30 nodes to perform information collection. To achieve a multi-angle collec of animals, without the assistance of SDN, the response area of the trigger signal is def as the pre-defined area 40 m × 40 m and the duration of the video is 10 s. In order to re the advantages of SDN, we set self-organizing network areas of different sizes: 20%, 2 30% and 35%. The coverage area only affects the deployment method with SDN struc because when using the WSN deployment method, the node's transmission power is o given in advance, which is hardware-dependent and not programmable. At the s time, 1% random noise is introduced.   When the nodes do not have the capability of the self-organizing network, they have fixed transmitting power and the situation of node activation is also fixed. Hence, it always has high energy consumption. Additionally in relation to the SDN structure, the adequate trigger energy consumption within a day is reduced by about 15%. Meanwhile, nodes will have more autonomy in a smaller coverage area and are not limited by pre-set regions, and the energy consumption reduction is better.

Discussion
The network model parameters are shown in Table 2. It can be found that the parameters of the network proposed in this article are nearly 40% less than that of MobileNet V2, which is close to the parameters of the ultra-lightweight network YOLOv4 Tiny. The main reason is that we use depthwise separable convolution in the backbone network to reduce parameters. At the same time, we fully consider the actual scene to construct SPP using 5 × 5 and 7 × 7 when extracting features, and use a one-way FPN structure to enhance features to retain precious animal spatial feature information.
The detection time for a single image is shown in Figure 10 (tested on the training machine RTX 2060). In terms of detection time, we have reached the advanced YOLO V4 level. It only takes 5 ms to test the model on the RTX 2060 to complete the backbone network inference. Compared with traditional cameras, even if the network we propose is transplanted to embedded devices such as ARM series development boards, it still has real-time solid detection capabilities. network to reduce parameters. At the same time, we fully consider the actual scene to construct SPP using 5 × 5 and 7 × 7 when extracting features, and use a one-way FPN structure to enhance features to retain precious animal spatial feature information. The detection time for a single image is shown in Figure 10 (tested on the training machine RTX 2060). In terms of detection time, we have reached the advanced YOLO V4 level. It only takes 5 ms to test the model on the RTX 2060 to complete the backbone network inference. Compared with traditional cameras, even if the network we propose is transplanted to embedded devices such as ARM series development boards, it still has real-time solid detection capabilities.  From the point of view of detection time, YOLO V4 tiny has several advantages in the real-time processing of the task, including only requiring 2.5 ms inference time and being able to quickly complete a target detection operation. However, we found that the map accuracy of YOLO V4 tiny is only 77.33%, which is much lower than that of Ournet with 89.52%. Considering the special situation of animal information collection in the wild, the animal information we need to collect is rare and precious. We do not want to miss any information and waste the extra energy consumption caused by the false trigger. YOLO V4 tiny does not perform SPP operation after feature extraction. SPP occurs in the highest dimension of feature extraction. This operation can effectively realize feature information reuse and fusion, but also increase the reasoning time of the model. If there are higher real-time requirements in other scenes, we can simplify our feature extraction network to speed up the reasoning time of the model.
When constructing our detection network OurNet, we used MobileNet V2 as the backbone. In Table 2, we found that the optimized network OurNet detection time is shorter than MobileNet V2. Compared with MobileNet V1, the detection time is slightly longer. It is due to the facxt that the MobileNet V1 network has only 28 layers, 13 of which use deep separable convolutions. The MobileNet V2 network has 54 layers, of which 17 layers of convolution use deep separable convolution. At the same time, we simplified V2 by using 512-and 256-dimension feature information as output, and abandoned 1024-dimension convolution to reduce parameters. Networks with greater depth require more time for model inference.
Therefore, between the accuracy and the inference time of the model, different scenarios need to be weighed differently to achieve better results. In the wild animal information collection scene, the detection time of 5 ms can already meet the needs better, so we used a higher-precision backbone network to achieve better performance.
As for the reduction of energy consumption, we mainly introduced DNN network to achieve the collection of specified animal information and reduce the energy consumption caused by false trigger. Meanwhile, the network programming advantage of SDN can also reduce energy consumption in terms of node activation. As shown in Table 1 and Figure 9 of the Result section, the self-organizing network of nodes can greatly reduce the extra energy consumption caused by the activation of redundant nodes, thus save the total network energy consumption, especially when the coverage is small. In the real wild environments, the placement range of infrared cameras is relatively small, in order to facilitate battery replacement and access to the information of the storage card. Thus, the SDN-assisted system is good for wild animal information collection in the future IoT networks.

Conclusions
In conclusion, based on the comprehensive consideration of detection accuracy, model reasoning time and model parameters, the traditional neural network structure cannot be well adapted to the actual application scenario of wild animal information collection. Through the fusion algorithm, the object detection network we propose reduces the model's reasoning time and the number of parameters while ensuring the detection accuracy. In this paper, we focus on the construction of lightweight target detection networks suitable for migration to sensor nodes. We also introduce SDN into traditional information collection scenarios and briefly analyse the advantages of SDN in energy consumption reduction. The network will effectively avoid redundant node energy consumption and achieve a self-organizing node network while reducing energy consumption. Thus, the proposed scheme can effectively collect wild animal information in software-defined IoT networks with limited resources. On the other hand, since the network structure we propose, which integrates multiple modules, is based on actual application scenarios of wild animal information collection, the generalization ability of the model in other application scenarios cannot be guaranteed. In future work, we will continue to closely integrate other artificial intelligence technology such as reinforcement learning with embedded devices and IoT system to achieve more efficient information collection.

Data Availability Statement:
The data set we used is public, called Oregon Wildlife. There are 14,013 images in the data set containing 20 kinds of animals. The data is widely used by wildlife researchers and can be used for classification or object detection tasks. We select part of the data to be relabelled to make it better to use in the task of object detection. Therefore, you can contact the author (6112118008@email.ncu.edu.cn) to obtain the corresponding labelled data set.

Conflicts of Interest:
The authors declare no conflict of interest.