A System for the Detection of Persons in Intelligent Buildings Using Camera Systems—A Comparative Study

This article deals with the design and implementation of a prototype of an efficient Low-Cost, Low-Power, Low Complexity–hereinafter (L-CPC) an image recognition system for person detection. The developed and presented methods for processing, analyzing and recognition are designed exactly for inbuilt devices (e.g., motion sensor, identification of property and other specific applications), which will comply with the requirements of intelligent building technologies. The paper describes detection methods using a static background, where, during the search for people, the background image field being compared does not change, and a dynamic background, where the background image field is continually adjusted or complemented by objects merging into the background. The results are compared with the output of the Horn-Schunck algorithm applied using the principle of optical flow. The possible objects detected are subsequently stored and evaluated in the actual algorithm described. The detection results, using the change detection methods, are then evaluated using the Saaty method in order to determine the most successful configuration of the entire detection system. Each of the configurations used was also tested on a video sequence divided into a total of 12 story sections, in which the normal activities of people inside the intelligent building were simulated.


Introduction
Camera systems are increasingly found in the interior of not only intelligent buildings. These systems are most often used for obtaining a video signal with the aim of recording or online streaming to the control room of the security service, that is, in order to protect people or property, which is the subject of References [1,2]. However, such a video signal can be used for detection or identification of persons present in the scene being captured after certain processing [3][4][5][6][7][8][9][10][11].
This article deals with detection of people in a scene being captured of the interior space of a building. The aim of this study is to create our own laboratory solution for the detection of persons in the room with the subsequent analysis of the data applicable for the intelligent building control system [12][13][14][15][16][17]. Other approaches to dealing with the issue of person detection apply principles different from detection from the image signal. For example, References [18,19] discuss modern methods of 3D detection of persons using the N2-3DDV-Hop method. Such and other methods described in References [18,20] offer position information in 3-dimensional space. In contrast to these methods, image detection is focused on information from the projection of space into a 2D image matrix. In Reference [21], a similar problem is addressed using real hardware. However, image processing is handled differently here. The solution described in this article is intended for implementation using a built-in system hardware type. Detection should work without the use of complex algorithms, using convolutional methods [22,23] such as neural networks [24][25][26], using neural classifiers for the purpose of detecting persons as in Reference [27], or methods of detection of persons using the Haar-cascade classifier as in Reference [28], with the simplest possible subsequent implementation on built-in devices [29][30][31]. Built-in devices for detecting persons can be based on the use of image sensors, similarly as described in solutions [32][33][34], in the form of a smart camera sensor with the function of preprocessing image data into binary images with white dots indicating the position of a person in the scene, a smart camera sensor for detecting the background of the scene and foreground of the scene as the position of the found person as in Reference [35], smart camera sensor with the function of neglecting the dynamic background as in Reference [36], a smart camera sensor for Histogram of Oriented Gradients (HOG) image data processing as in Reference [37], or a specialized solution of the System on Chip (SOC) coping with basic image processing tasks such as edge detection in References [38,39], as an edge detector [40], or a solution with a low-power smart CMOS image sensor used to detect persons for indoor and outdoor use as in Reference [41]. Other image processing solutions, such as edge detection using digital parallel pulse computation [42], a non-parallel Sobel edge detector addressed by a smart camera sensor [43], discuss similar solutions that serve as sources of information providing a wide range of possible alternative solutions.
Finding the persons of the solution stated consists in finding a suspicious object in the scene, checking that the object is of a sufficient size, and checking that it may be an existing finding. This is followed by evaluation and printing of the information obtained for further possible processing, for example by the intelligent building control system [44][45][46]. Detection and localization is described in References [47,48], which are texts on camera systems for finding objects in an image matrix; in References [49][50][51], the methods of locating objects/persons in the image are discussed. References [48,52] were used as sources in tracing, and text [2] was used as a source of information on tracing using calibration methods and inverse kinematics methods as another direction in dealing with the issue of tracing.
The system devised finds suspicious objects or potential persons by subtracting two frames from each other. For example, subtraction of the current and previous frame processed, when the methods creating a static or dynamic background (replacing the previous frame found in the basic model of the system) were tested. References [38,42,53] were used as sources for work with a dynamic background and subsequent subtraction of the area of interest from the background-addressing similar solutions as this work, but in a different way. The binarized difference in brightness levels of these frames is subsequently subjected to a check of the size of the findings. The difference is obtained based on the results of the difference frame given by the frame rate (fps) used and the thresholding algorithm sensitivity. Another solution applied is the use of the Horn-Schunck method of optical flow [54,55], whose output is, similarly as the difference method output, binarized and further processed by decision algorithms, see Reference [56]. The detection, localization and tracing solutions described in this text are addressed here using only a camera sensor, comparing to other methods using a laser scanner, ultrasonic sensors, PIR sensors, as in References [1,[57][58][59].
Persons are detected based on the assumed dimensions of their silhouettes. Since a person's silhouette can take different shapes and dimensions in different camera shots (depending on the optical distortion and the location of the lens in the room), the parameters of this detection filter must be set to a particular scene. They discuss, in more detail, the basic methods of image processing for systems with low energy demands [40,60,61]. A comparison of available camera sensors for low-power systems is provided in Reference [62], but another source of image data was used in this work. In this work, a scene from the ceiling camera (Type-ASUS ZenFone 2 ZE551ML mobile phone component) located in the middle of the room using a 170 • additional camera lens was applied. In this solution, an unpreprocessed video signal is used, that is, similar to the output of the precialized circuits mentioned in References [63,64]. The description of background detection algorithms, detection of suspicious findings and the algorithm of the final process of distinguishing the persons detected from other dynamic objects is provided in the following section.
The proposed methods in the final version of the solution are designed for low-power systems, ideally based on an SOC (System on a Chip) technology, which will be employed for the use in intelligent building applications, which is more suitable for this issue than using high-power PC graphics cards as in Reference [65]. Solutions using special chips addressing the basic tasks of image processing are provided, for example, in Reference [66], where such a system can be used for data preprocessing of a more complex system. The primary benefit of the study is the testing of the individual detection methods applied on a test video sequence including possible interfering light effects in the room. The solution proposed can be used for implementation on built-in devices.
This concept deals with the detection of persons from originally untransformed image data of an RGB camera for common use. The paper is focused on Low-Cost, Low-Power, Low Complexity-hereinafter (L-CPC) image recognition system research and it give us limitation for possible colour model. RGB colour model for algorithms and methods comparison is chosen because of image CCD and CMOS sensors common output signal. There are other necessary additional calculations for transforming to HSV, HSB, HSL colour models [67][68][69]. Of course, these models could be better for recognition results, but it is not applicable for L-CPC image recognition system with request for the highest possible process speed and the lowest computational load [70,71]. The option of use in conjunction with the aforementioned methods would be possible for these purposes of the thermal infrared camera. The error factor entered in the form of brightness deviations (caused, for example, by sharp sunlight entering the room through the window) could thus be replaced by thermal deviations in the scanned part of the scene. Section 2 describes differential methods for detecting persons from the input video signal. Our three detection methods using computation towards a static background, three methods using computation towards a dynamic background and a method using the principle of optical flow are described here.
Section 3 describes the applied methods of adjusting the detected findings and the method of their filtering, identification, and description. Section 4 describes the methods of evaluating the quality of detection of persons of each of the described methods. The evaluated sections of the test video used are described here, followed by the evaluation of the detection quality.
Section 5 describes the overall evaluation of the work and the results contained therein.

Methods
The developed and implemented methods described in the article were specified according to the differences in the approach to detecting the image background of the original image sequence. This not quite traditional approach based on recognition algorithms only on the variability of background detection. This way was chosen by the authors of the research work mainly with regards to own experiences and from selected literature [54,55,72,73]. Traditionally from others, the procedure for background recognition is always described, but this is not a fundamental problem and the core of the research activities. From our point of view, the variability of background detection methods is crucial for human movements recognizing. During the research, we analysed several original background detection algorithms, which were inspired by the available literature. By adapting the methods for presented detection problem, we were able to design our own original solution for each of the algorithm. The individual designed and analysed methods are described in detail below and evaluated with respect to the object recognition accuracy and success rate.
The methods of differentiating the current frame from the background can be based on both a static background and a dynamic background, see References [72,74]. Although both methods may have a plethora of variations of creation of the backgrounds used and their application in differential search for suspicious objects, there are a total of six custom solutions to this problem, three to find the difference towards the static background where each test frame is compared to a fixed frame, and three towards the dynamic background where each current test frame is compared to a background frame dynamically adjusted based on n-previous frames.

•
The method of differentiating the current frame towards the static background, see Reference [72].

•
The method of differentiating the current frame towards the dynamic background, see Reference [72].

•
The method of optical flow, see Reference [55].
The changes detected in the frame are found to be appropriately thresholded and are then distributed to other parts of the program as a binary map of the objects found as described in the diagram of the activity applied in Figure 1.  Appropriate selection of the threshold for adjusting the output image results in filtering out low-contrast changes. These changes may represent slight added shadows, or large changes in brightness may also be perceived as an undesirable finding. In order to increase the robustness of the filtration system, it would be advantageous to use two threshold levels, that is, multi-level thresholding. However, in the described solution, such filtration was not used in order to demonstrate the behaviour of the described detection methods.
The selected methods are structured according to the type of work with the background image data of the detected persons. The individual versions were set up for laboratory testing, within the test video sequence, so that each of the versions achieved the highest possible success rate in detecting people. The purpose of the mutual comparison of methods very susceptible to changes in the image and sudden changes of immune methods is to point out their possible applicability in various applications. Due to the complexity of the analysis of object recognition methods, we decided to introduce as many versions and types of methods as possible. These allow embedded systems with limited computing capabilities to recognize object recognition in the image. The only complete analyses of different method versions give us complex and unquestionable answer to the most optimal algorithms.

The Method of Differentiating the Current Frame towards the Static Background
For clarity purposes, the method is hereinafter referred to as Method 1. This method uses the differences between the current frame and the background frame. A static background is considered to be a background fixed for the duration of the program run or for a period otherwise specified. Typically, such a background is compared to more than one frame of the scene. The principle of such a method is illustrated in the diagram in Figure 2. The advantage of using this method is a considerable saving of memory space when, after the static background is finally found, it is necessary to keep only one frame of the final background. A background created in this way is usually given by the first frame of the video sequence. The frame intended for the background should be based on the static scene in the image, ideally a scene devoid of objects dynamically moving at the time of taking the picture. Such a condition for capturing the background image ensures a minimum error rate of the subsequent detection of dynamically moving objects in future frames of the video sequence. The principle of the method is illustrated in the diagram in Figure 3. However, using this method of finding a background for change detection in the frame is very susceptible to changes in the lighting conditions in the scene of the frame compared to this background. An example of using this method is shown in a series of frames Figure 4.   The background thus created is based on the gradual weighted averaging of the first n-frames. Due to frame averaging, the effect of changing the lighting conditions in the scene is minimized, thereby minimizing the effect of unwanted changes in pixel brightness in the frames being analyzed during the subsequent detection. The purpose of this method of background creation is also neglect of possible dynamically moving objects in the scene at the beginning of the video sequence being captured. An example of using this method is shown in a series of frames Figure 5. The background can be practically created even when dynamically moving objects are present in the scene. However, such objects can introduce significant errors in the subsequent results of the detection system.
In the system code applied according to Figure 6, a background weight having a value of 5 is used. The final background is produced from the first 10 frames of the video sequence.   The background thus created is based on the assumption that there may be dynamically moving persons or other objects in the scene while taking pictures of the scene in order to create a static background. The principle of such a method is to gradually find fixed regions between successive frames that are gradually put together into the final background map. This algorithm is described in the diagram below in Figure 7. If this is an ideal scene that will be completely unchanged during the first two frames of the video sequence, specifically in the entire area of the frame checked, the final background will consist of pixels of the same brightness levels as the original frames. Since practically such ideal conditions cannot be achieved, there are different regions between every two frames, from tens of pixels per frame to almost the entire area of the frames compared (such an error rate may be caused by either evaluating the lighting conditions in the scene by the sensing device or by changing the lighting conditions in the scene being captured itself). Technically, by adding a static-looking region to the background map in the final background thus produced, a frame of a static-looking dynamic object may be introduced as an error in the background creation method. For example, persons sitting still in the room.  To make the background algorithm functionality applicable also in non-ideal conditions, adjustable tolerance of the match of brightness level of shades of grey was added for the reasons of making a decision on the match of the pixel brightness value of the frames compared. An example of using this method is shown in a series of frames Figure 8. Since, in the initial state, the background frame has a value of 255 in each pixel, that is, a brightness level corresponding to white, in the final frame of the background acquired, indeterminate regions or regions with a constant change between frames attain a brightness level of just 255. Such regions will then be evaluated as findings of objects, but of an error nature, in the subsequent binarization by thresholding.

The Method of Differentiating the Current Frame towards the Dynamic Background
For the sake of clarity, the method is referred to as Method 2. A background that is not compared twice to any current frame (i.e., the background is adjusted, if not changed completely, for each frame checked) is considered to be a dynamic background or, also, a variable background.
These sections describe the individual methods for the purpose of their mutual comparison. By comparing the described assumptions and the real experiment behaviour of the implemented methods, there is presented with a set of image processing options for the detection of persons by low-performance embedded camera systems used in intelligent buildings. Different methods are applied to create a various dynamic background, because it is very important for comparation of the most optimal way to background presentation. Recognition algorithm success rate for person detection directly depends on the optimal background definition. Based on these theoretical assumptions and own research experiences, we decide to analyse three methods with the most dissimilar results for background definition in our research. The chosen background detection methods are comparison frame previous to the current frame, average of all previous frames, average of n-previous frames. Each method brings advantages and disadvantages, which are described in text.
The difference between the current frame towards the background is calculated only after the initial background is created. A new background is created only after the current background is used for the image change detection algorithm (in the first pass of the differential part of the program, the current background is the initial background), that is, after the termination of the part of the program that uses the current background, to prevent image data collision. The new background is either a brand new frame or an existing background frame modified according to the background creation method used. Such a new background is an image function of the background of the old and new frames. The algorithm depicted above is also described in the diagram attached, in Figure 9.  Within this difference method of frame change detection, three methods are applied to create a dynamic background, where the background is: As shown in the background creation diagrams above, this background creation method always uses the previous frame as the background for the current frame. Frame change detection based on the difference towards such a background is focused primarily on dynamically moving objects, namely objects showing a constant change between frames. The algorithm depicted is also described by the diagrams in Figure 10. The detection of an ideally constantly moving object, which can be the person being detected, is exceptionally reliable from the perspective of the theoretical assumption of the functionality of the change detection method. A great advantage of such change detection is the high resistance to changes in the lighting conditions in the scene between frames of the video sequence. The disadvantage of frame change detection, based on the principle of the method, is the subsequent increase in the error rate when the movement of the person (object) in the image scene is stopped. However, such an object, amongst the frames being dealt with, looking like a static object, cannot be found then by the detection algorithm as an object moving dynamically in the entire video sequence. The results of the behaviour of this method are shown graphically in a series of frames in Figure 11. (d) Binarized difference in brightness between the currently checked frame and the background frame. Findings that reach a value of logical 1 (white in the frame field) are considered potential findings of a detected person.

Average of All Previous Frames (Method 2.2)
The method of creating a dynamic background as the arithmetic mean of all the previous frames is a dynamic analogy to the above-mentioned method of creating a static background. The method is described by the diagrams in Figure 12. The background thus created is based on the gradual weighted averaging of all the frames of the video sequence. Due to frame averaging, the effect of changing the lighting conditions in the scene is minimized, thereby minimizing the effect of unwanted changes in pixel brightness in the current frames during the subsequent detection.  The purpose of this method of background creation is also neglect of possible dynamically moving objects in the scene by setting a weight greater than 1, that is, the existing background weight having a value of 1 when creating a new background. By preferring the information from the previous background, neglect of the new frame is caused as well as changes in the scene associated with this neglect. Therefore, if the change in the scene is constant, it will become less important with each newly created background due to the subsequent difference of the current frame towards the background. However, the difference towards such a background is not resistant to significant sudden changes in lighting conditions in the scene. It is evident from the principle of the method that the change in the lighting conditions in the scene must be as small as possible in order to minimize errors (methods of detecting the change in the background). The weight of the previous background frame towards the current frame is a particularly important factor. The greater the weight of the previous background frame, the more negligible the errors of the newly added current frame will be in the new background. The aim of this method is to create an idealized restoring background in which the error noise caused by dynamically moving objects in the scene of the original frame is eliminated. An example of using this method is illustrated in a series of frames Figure 13. However, the differential frame change detection in the current image from such a background is not resistant to sudden changes in the lighting conditions in the scene. Such a sudden change in lighting conditions cannot be evaluated other than an error of the method, indicating a high error rate in the entire frame.

Average of n-Previous Frames (Method 2.3)
Similarly to the method of difference towards the dynamic background created by averaging of all the previous frames, the functionality of the method using a background created by averaging n-previous frames is dependent on setting the weighting parameter of the previous frame. In addition to this adjustable parameter, it is also necessary to determine the number of frames that will form the dynamic background of the current frame when the difference is detected in the image. The weight parameter of the previous background frame, when creating a new background, affects the gradual disappearance of images of dynamically moving objects or persons in the scene in order to achieve the most accurate background for the subsequent differential detection of changes in the current frame towards the background frame. The method is described by the diagrams shown in Figure 14, describing the creation of the initial background, and Figure 15 describing the creation of each new background. This method is designed to eliminate the effects of dynamic object images in the scene of the current frame involved in creating a new background frame. In contrast to the aforementioned method of averaging of all the previous frames of the current frame, a limited number of previous frames are used for the current dynamic background of the current frame. This prevents the past error from being brought into the current background. The weighting calculation setting is also adjusted so that the oldest background frame took the highest weight, thus bringing the highest error rate to the resulting frame. In this way, the error detection of dynamically moving objects in the scene, whose position dynamic change is expected in the scene of the current frame, is limited. Figure 16 shows an example of the output of this method.  (d) Binarized difference in brightness between the currently checked frame and the background frame. Findings that reach a value of logical 1 (white in the frame field) are considered potential findings of a detected person.
Of the above-mentioned methods, this method is the most memory-intensive method, because it is necessary to preserve just n-frames of the temporary background for the background created from n-previous frames (background frames are gradually completed according to the algorithms described above).

The Method of Optical Flow (Method 3)
For the sake of clarity, the method is referred to as Method 3. This is a method developed to detect dynamically moving objects in a scene. The program calculates a velocity vector that reflects the direction and magnitude of the velocity of motion in the image using the optical flow method and the resulting brightness function calculation. However, from this velocity vector, information about the velocity vector magnitude at each point of the image examined is taken for further processing. This information of potentially moving objects is then thresholded in order to obtain binary images of the findings. The application of the optical flow method is described in the diagram below in Figure 17. This is a method that is very sensitive to changes in the lighting conditions in the scene of the input frames. In contrast to the difference methods, which deal with the change in the image consisting of greyscale input frames, the method used deals with finding the change between the input frames, the current frame, and the background frame, in the original RGB colour model. Since a third-party code is used, the algorithm of this method will not be described in detail in this work. The Horn-Schunck optical flow solution algorithm is available on the MathWorks MATLAB support site [54,55,73]. An example of applying the method described to a moving object detection algorithm is shown in a series of frames in Figure 18.

Adjustment of the Objects Found
This part of adjustment of the block of the objects found is designed to adjust the dimensions and integrity of the objects found. This adjustment is achieved by means of the following steps: 1.
Filter erosion of the objects found to remove image noise findings.
Erosion of the object found.
Each of these steps has its reason in the final shape and size of the object found. Filtering the findings caused by noise in the image using erosion is a method of minimizing the potential for increasing error rates in the subsequent dilatation step of the objects found. The work uses erosion by three pixels for this purpose, as shown in Figure 19, that is, each of the objects found is reduced line by line, always by one pixel of a positive finding. During dilatation, the size of each object finding is extended by a user-adjustable factor of the region around the particular finding point. This step is used to combine discontinuous object findings into an undivided object. The frame dilatation effect in Figure 19b is shown in Figure 20a. However, the adjustment of the aforementioned dilatation region factor is different for the resulting binary image of each of the methods of frame change detection used, as needed. For example, when using the change detection method by means of optical flow, a significant expansion of the findings is necessary, since the findings belonging to the location of the person in the scene of the current frame may occur separately from the findings of objects belonging to the same person in the scene. If added shadows occur, the success rate of the methods selected for the correct detection of persons in the image is often reduced. In the case of differential methods, such as Method 1.1, Method 1.2, Method 1.3, Method 2.1, Method 2.2, Method 2.3, the shadow could be perceived as a new detected person. To evaluate a sufficient change in the image, sufficient to evaluate a new person, it is thus crucial to set a threshold level for filtering the input data. To clarify this possible phenomenon, a text was added to the article (Section 2) explaining the behaviour of the described methods in the case of the occurrence of excessive shadows. In addition, the practical implementation envisages the addition of an algorithm comprising definition of areas where newcomers can appear and where they can leave the room. The article no longer deals with this functional superstructure, but it will significantly contribute to the elimination of the above-mentioned phenomenon.

Identification of Object Findings
The binary image of the objects adjusted is then subjected to the identification of the individual object findings.
In the first stage of identification, the positive pixels (belonging to the potential person) of each continuous pixel cluster according to the 4-neighbour rule are grouped into regions identified by the original identifier.
The next stage of the identification of the object found is marking into the delimited regions and their possible joining into larger units. The identified object regions found are further distributed by the vector description of the rectangular areas, that is, the description of two pixels in the image. The first, initial, point is given by the smallest horizontal pixel coordinate and the smallest vertical pixel coordinate of the object described. The second point is then given by the largest horizontal pixel coordinate and the largest vertical pixel coordinate of the object described. The principle of marking the object found is explained in the attached illustration in Figure 21. The objects found that are thus delimited by the rectangular areas are then checked for intersection with another delimited region in the delimitation area. If they have intersection areas, they are then unified into a common region of the object, as shown in the illustration in Figure 22.
The information about the object found is inserted in a two-dimensional description field of the objects found in the frame currently checked for the reasons of further analysis. After the check and unification of the objects into marking regions are completed, the applied image is no longer used by the algorithm, as other parts of the algorithm only work with the so-called rectangular area marking vectors. x A x C x C = (x A + x B ) -d x y C = (y A + y B ) -d y y C Figure 22. Illustrations describing the unification of the intersecting objects found into the final object.

Filtration of Error Findings from Findings of Persons
The object findings adjusted and identified are filtered according to the dimensions of the custom rectangular delimitation and the actual pixel area of the positive pixels in the marking field.
In the above-mentioned illustration in Figure 23, vectors representing regions indicating object findings are plotted. Vectors denoted as − → a 1 and − → a 2 represent regions smaller than the minimum dimensions of x min and y min , specified by the condition for filtering object findings from findings of persons. Vectors denoted as − → b 1 and − → b 2 represent regions, one dimension of which is less than the minimum dimension limit of x min or y min , specified by the condition for filtering object findings from findings of persons. Vectors denoted as − → d 1 and − → d 2 represent regions, at least one dimension of which is more than the maximum dimension limit of x max or y max , specified by the condition for filtering object findings from findings of persons. Thus, the findings delimited by the regions represented by vectors denoted as − → a 1 , − → a 2 , not meet the requirements of the filter criteria set according to the dimensions and are, therefore, not further processed as findings of persons. In contrast, vectors denoted as − → c 1 and − → c 2 represent regions acquiring both dimensions greater than the minimum dimensions of x min and y min and smaller than the dimensions of x max or y max , specified by the condition for filtering object findings from findings of persons. Such object findings are further analyzed as findings of persons.

Cataloguing of Persons Found
In the block of cataloguing, the persons found, represented by a description of their areas of location and size, are identified, and subsequently monitored to see if these findings are still current in the scene. It is in this part of the system for the detection of persons in intelligent buildings that the findings of persons from current frames are set in context with the findings of persons from the previous frames. The cataloguing block works with a two-dimensional array of information, in which each current finding is assigned with one row of this array of information. The array of information is the bearer of information about the persons currently found as well as error findings in the scene, which are assigned with a lifespan value of each of the findings within this block.
The index that is assigned to each person newly found or a person currently restored at the maximum possible user-selectable value, called the "maximum lifespan of a person finding", is designated as the lifespan of person findings as shown in the frame in Figure 24a. Each time the program loop passes through, the lifespan value of each person found is decremented as shown in the frame in Figure 24b. If the person finding in the array of current findings is no longer restored during the period of his/her non-zero lifespan and the person finding has a lifespan of zero, the person finding is removed from the array of current findings as an outdated or error finding.  As part of its identification, each of the new person findings is compared to the person findings listed in the array of information about the currently occurring person findings in the scene. The evaluation of conformity of the person findings is conducted by means of detecting the intersections of the mutually checked rectangular areas of the delimitations of the findings. The effect of detecting the intersections of the areas found is shown in the frames in Figure 25. In the case of finding the intersection of the areas checked, the existing finding is overwritten by the new finding and supplemented with the maximum lifespan of the person finding.
In the example below, illustrating the structure of the individual output data of the final system, the information characterizing the two objects from Figure 24 is shown, namely Figure 24b, when one of the objects/potential-persons found acquires a 100% lifespan, wherein the lifespan of the other one is already reduced compared to the frame in Figure 24a, where both findings have a 100% lifespan. Table 1 provides an example of all current person findings, in the frame in Figure 24b, including their current lifespan. Since, when producing these examples, the maximum, that is, 100% lifespan, is set to 4, that is, 4 frames, the most current person finding, mentioned above in the tables of the initial findings and unified person findings, is correctly evaluated by the lifespan index in the first row. The next finding, having a lifespan index value of 3, is the finding that was last updated during the previous program pass, that is, in the previous input frame. Furthermore, the table of output information transferred to the superior system includes data on: the number of custom pixels belonging to the person found and information on location of the individual findings in the matrix of the image checked.

Results
For each of the methods used, the person detection system was applied to the same video sequence, which was designed to simulate the normal operating conditions of a camera in intelligent buildings. The test video was divided into 12 partial event sections, which were individually evaluated and statistically processed. Table 2 depicting the test video event sections describes the activities of the persons in each section. These test video sections are designed to capture the most common possible behaviour of persons in a scene being captured by the camera system of an intelligent building. It includes entrance, exit, movement of persons, occasional stopping of persons and shifting of objects in the scene.

Verification of the Results
In order to determine the correct function of the algorithms applied, it was necessary to create a verification template of the results. The results of the verification had to be generated for each of the frames tested so as to sufficiently assess the quality of the detection of the persons. In the test video sequence, the verification of the results achieved was thus conducted on the frame-by-frame basis.
The final weighted detection quality assessment (K) is a result of the sum of the individual weights of the quality criteria (K) multiplied by a percentage of the quality criteria met p K . The weights of the quality criteria are determined by the Saaty multi-criteria evaluation method according to the source cited below.
The values of the criteria weights are determined using geometric means of the rows of the Saaty matrix, plotted in the Table 3. If these row geometric means are standardized, the standardized weights of the set of criteria being dealt with are obtained [15].
where v i is the standardized weight of the i-th criterion, G i is the geometric mean of the i-th criterion, and n is the total number of criteria. The work uses five quality criteria. Each of these criteria stated expresses the correctness of person findings detected in the scene being captured. The values provided in the rows and columns of Table 3 are based on the importance of one criterion in relation to another. There are values of 1 in the diagonal because criterion K 1 in the row is equivalent to criterion K 1 in the column. The highest criterion is a value of 9, wherein the criterion acquiring the value of 9 is absolutely more significant than the criterion in the relevant column. The preferences between the individual criteria are determined expertly according to the requirement for strictness of the final weighted detection quality assessment.
Since it is necessary to include the information on whether or not the quality assessment criteria are met in such a system of weights devised in order to conduct a cumulative assessment of the system detection quality results, the resulting weighted quality assessment is obtained by the following calculation: where the weighted detection quality assessment is denoted as K, v i is the standardized weight of the i-th criterion, and p k i is the percentage of the i-th criterion. By using p k i as a percentage of meeting the i-th criterion, the resulting value of the weighted quality assessment K is in the range of 0 to 1. The weighted detection quality assessment K equals zero if each of the percentages of the i-th criterion equals zero, or the person finding detected is negative, or, also, ideally incorrect. Conversely, K equals 1 if the ideal person finding detected is correct. As it is obvious from the following relationships: In practice, however, the ideal state where K equals 1 is very difficult to achieve. This is because the percentage of meeting the i-th criterion, p k i , is below 100% of the ideal state for each person detection quality assessment criterion.
The example shown in the figure (Figure 26) indicates the distribution and importance of the individual weights of the weighted detection quality rating to the final rating (K). Test video frame number As it is obvious from the example of the results of the average weighted rating of Method 1.1 shown in Table 4, this functional configuration achieved the best rating with a lifespan of the person found of 12 frames (i.e., a lifespan of 3 s at 4 fps), with an average measurement error of 7.5%. The results shown in Table 4 are indicated in the attached figures (Figures 27a, 28b and 29a).
Many methods, as illustrated in the resulting graphs, from Figures 27-47, had a significantly increased error rate in the first and last section. This is caused by the inconsistent presence of detectable persons in the room being checked and their erroneous assessment, either due to the background creation delay or the delay caused by the excessively long lifespan of the persons detected. By comparing the results of Method 1.1, evaluation at a lifespan of 1, 12, and 40 frames, it is ascertained that the best detection results were achieved within video sections 2 to 5. A graphical representation of the evaluation of the configurations used is shown, for results with a lifespan of 1 frame, in Figure 27, for results with a lifespan of 12 frames, in Figure 28, for results with a lifespan of 40 frames in Figure 29. In subfigure (b), the selected figures show boxplots of the statistical distribution of the weighted evaluation in selected sections into quartiles. The final summary of the results of all the methods in Table 5 includes the weighted average evaluation results for all the sections of the video, wherein the best weighted and scaled scores were achieved with Method 1.2 having a lifespan of 12 frames.

Evaluation of Method 1.2
By comparing the results of Method 1.2, evaluation at a lifespan of 1, 12, and 40 frames, it is ascertained that the best detection results were achieved within video sections 2 to 4. A graphical representation of the evaluation of the configurations used is shown, for results with a lifespan of 1 frame, in Figure 30, for results with a lifespan of 12 frames, in Figure 31, for results with a lifespan of 40 frames in Figure 32. In subfigure (b), the selected figures show boxplots of the statistical distribution of the weighted evaluation in selected sections into quartiles. The final summary of the results of all the methods in Table 5 includes the weighted average evaluation results for all the sections of the video, wherein the best weighted and scaled scores were achieved with Method 1.2 having a lifespan of 12 frames.   By comparing the results of Method 1.3, evaluation at a lifespan of 1, 12, and 40 frames, it is ascertained that the best detection results were achieved within video sections 2 to 4. A graphical representation of the evaluation of the configurations used is shown, for results with a lifespan of 1 frame, in Figure 33, for results with a lifespan of 12 frames, in Figure 34, for results with a lifespan of 40 frames in Figure 35. In subfigure (b), the selected figures show boxplots of the statistical distribution of the weighted evaluation in selected sections into quartiles. The final summary of the results of all the methods in Table 5 includes the weighted average evaluation results for all the sections of the video, wherein the best weighted scores were achieved with Method 1.3 having a lifespan of 12 frames; the best scaled scores were achieved with a lifespan of 40 frames.

Evaluation of Method 2.1
By comparing the results of Method 2.1, evaluation at a lifespan of 1, 12, and 40 frames, it is ascertained that, within the lifespan increase, an increase in the stability of the results of the weighted detection assessment in the individual sections was achieved. A graphical representation of the evaluation of the configurations used is shown, for results with a lifespan of 1 frame, in Figure 36, for results with a lifespan of 12 frames, in Figure 37, for results with a lifespan of 40 frames in Figure 38. In subfigure (b), the selected figures show boxplots of the statistical distribution of the weighted evaluation in selected sections into quartiles. The final summary of the results of all the methods in Table 5 includes the weighted average evaluation results for all the sections of the video, wherein the best weighted scores were achieved with Method 2.1 having a lifespan of 12 frames; the best scaled scores were achieved with a lifespan of 40 frames. As it is obvious from the results in both Figure 36 and Table 5, Method 2.1 achieves very poor results with a lifespan of findings lasting 1 frame.    Table 5 includes the weighted average evaluation results for all the sections of the video, wherein the best weighted scores were achieved with Method 2.2 having a lifespan of 12 frames; the best scaled scores were achieved with a lifespan of 40 frames. As it is obvious from the results in both Figure 39 and Table 5, Method 2.2 achieves very poor results with a lifespan of findings lasting 1 frame.    Table 5 includes the weighted average evaluation results for all the sections of the video, wherein the best weighted and scaled scores were achieved with Method 2.3 having a lifespan of 12 frames. As it is obvious from comparing the results in Figures 43 and 44, Method 2.3 achieves the best score when applying a lifespan of findings of 12 frames. The error rate of the assessment increases as the lifespan extends.   By comparing the results of Method 3, evaluation at a lifespan of 1, 12, and 40 frames, it is ascertained that, within the lifespan increase, an increase in the stability of the results of the weighted detection assessment in the individual sections was achieved. A graphical representation of the evaluation of the configurations used is shown, for results with a lifespan of 1 frame, in Figure 45, for results with a lifespan of 12 frames, in Figure 46, for results with a lifespan of 40 frames in Figure 47.
In subfigure (b), the selected figures show boxplots of the statistical distribution of the weighted evaluation in selected sections into quartiles. The final summary of the results of all the methods in Table 5 includes the weighted average evaluation results for all the sections of the video, wherein the best weighted scores were achieved with Method 3 having a lifespan of 12 frames; the best scaled scores were achieved with a lifespan of 40 frames.

Mean Error Calculation
The results of the frame average assessment in the sections shown in column (a) in the figures (from Figures 27-47) were plotted together with error bars having a size of i , both in the positive and negative direction. i represents the mean error of the i-th section of the video. i is calculated using the arithmetic mean ( k ) where k is the index of frames belonging to the i-th section. k is the absolute value of the difference between K i and K k where i is the section index, k is the index of the frame being dealt with, n is the number of frames in the section, K k is the weighted rating of the selected frame and K i is the weighted rating of the selected section.

Assessment of the Best Working Method for the Test Video Scene
Since the assessment of the results in the sections using the Saaty method is not a sufficiently transparent assessment for expressing the incorrect or correct functionality of the detection algorithm, all the results of the weighted assessment are further evaluated by the so-called scaled scores.
The limit values dividing the range of the weighted quality rating value (K) into the following intervals are:

•
Lower limits of the correct finding (Q1).

•
Upper limits of the incorrect finding (Q2).
Each of the assessment intervals is assigned with a value of mark γ (note: Greek letter gamma) used for the resulting calculation of the averaging detection quality of the results thus obtained. The result is a rigorous detection quality assessment that can provide better information than the previous weighted detection quality assessment relating to the success or failure of person detection in intelligent buildings using the above-mentioned system setting configurations applied.
Thus, the rating ( L ) of the section of the test video containing k-frames assessed will be as follows: The arithmetic mean of the person detection results of the individual frames at the configuration of the system of detecting persons in intelligent buildings. If the average rating ( L ) is positive or near 1, this is a good detection. If the average rating ( L ) is negative or close to 0, this is a poor detection. Selected top-rated methods from the section comprising person detection results of the individual methods applied at different lifespans of person findings are selected from Tables 5 and 6 as the most successful system configuration for detecting persons. In this table, the system configurations are ranked from the top-rated method to the bottom-rated method, according to the three methods for evaluating the results of the scaled scores, designated as (A), (B) and (C). Wherein score (A) is the assessment of the ranking of the results according to the average scaled scores in all sections of the video; score (B) is the assessment of the ranking of the results according to the sum of the rankings in all sections of the video; scoring (C) is the assessment of the ranking of the results according to the ranking in all sections of the video, except for the first and last section. The ranking of the methods selected thus assessed is more representative in terms of detection quality assessment because of the omission of the results from the first and last section, in which there is an unevenly increased detection error rate in all the methods due to the inconsistent presence of persons in the frames of these sections and, therefore, there is the absence of verification patterns. The best-rated method achieves mark 1; the less successful methods receive lower marks.  Table 6. Table ranking

Conclusions
The aim of the research was to analyze and evaluate the standard and original methods of detection of the dynamic movement of persons when a person is perceived as an object moving in the scene being captured and the solution described emphasizes the appropriateness of the method providing data for the subsequent finding of a potential person in the scene being captured. Each of the original background creation solutions analyzed and the subsequent detections have their own specifics that must be taken into account and it is necessary to proceed from the research results achieved.
Appropriate weighted criteria listed in Table 3 are an important criterion for the correct evaluation of the results depending on the system requirements. The results in the final evaluation Table 6, including the best results of each of the methods from Table 5, are ranked not only according to the weighted and scaled rating in the entire range of the test video sections checked ( Table 6, columns (A), (B)), but also according to the scaled rating in the range of the video sections, except for the first and last section, which resulted in significant detection errors in the rating ( Table 6, column (C)). Specific methods increase errors due to detection of redundant objects in the sections, or they did not detect persons who, according to a set of verification data based on the number and possible occurrence of persons according to Table 2, do not occur in the scene being captured.
From the results shown in Table 6 (of the top-rated results), it is obvious that the top-rated configuration is the configuration of Method 1.1 having a lifespan of 12 frames, where the best-weighted score (K = 0.487) was achieved compared to the worst rated method (Method 3) with the following score: K = 0.328. The best-scaled rating was achieved by Method 2.2 with a lifespan of 40 frames and a result of L = 0.789, and aforementioned Method 1.1 with a lifespan of 12 frames with a result of L = 0.78; the worst-scaled method is Method 3 with a lifespan of 1 frame and a result of L = −0.015. In the attached weighted rating graphs (Figures 27-47), precisely the influence and importance of using the appropriate value of the lifespan of persons, which neglects the momentary "loss" of the person found in the results of the detection of the current change in the image, are obvious. Method 1.1 with a lifespan of 12 frames is a method that uses the difference between the current image towards the static background created by the frame taken at the beginning of the test video. Errors confirming the theoretical functionality of this detection method were entered in the evaluation results (Figures 27-29). These errors include longer-lasting relocation of the objects in the scene, which are then mistakenly considered as a person found; this error could be resolved by occasional restarting of the detection system, thus creating a new static background.
In this best-rated configuration, the lifespan is set to 12 frames, that is, at a frame rate of 4 fps of the test video, the lifespan of the person found is 3 seconds. Such a system setup is therefore ideal for scenes being captured where people are moving rather than acquiring static conditions in the scene.
The lowest score of the scaled rating was achieved with Method 3, which is a method that uses only memoryless optical flow. According to the results in Figures 45-47, this method achieved the best detection results in section 7 shown in Table 2. This method provides information about the person detected only in pixels that are moving between the frames tested in the scene. A significant increase in the error rate was thus achieved in the test sections where no or only slight movement, insufficient for the detection of the person, was performed in the scene. This increase was compensated by the lifespan factor of the findings that were detected correctly when static conditions were acquired in the scene.
Other methods applied mentioned in Section 2 using both static and dynamic backgrounds achieved results confirming the theoretical assumptions. An important role in their application was played by the suitably selected lifespan parameter of the persons detected. As it is obvious from Figures 27-47 with the plotted results of the individual sections assessment at the given system configurations, the average rating further evaluated with an error entered does not always correspond with the statistical distribution of the individual statistical files-the video sections. Outlying values are also included in the average in order to make the evaluation as strict as possible.