The spread of the Industry 4.0 paradigm [1
] has led the manufacturing industry to continuously improve as a result of advances in technology. Companies have sought virtualization and decentralization through flexibility, transparency, and integration, which are some of the Industry 4.0 design principles [2
]. In order to become flexible and integrated, firms collect data to feed digital copies of everything into a Cyber Physical System (CPS) [3
]. Data collection is also beneficial for implementation of the Lean Manufacturing (LM) paradigm, which, together with the Industry 4.0 paradigm, has been listed by many researchers as management theories and practices with reciprocal synergies [5
]. It is usually easy to gather data using one or more contemporary technologies in highly standardized contexts. However, some complex environments still exist, especially those in which human activity is prevailing, where only very advanced and customizable technologies are smart enough to perform appropriate monitoring tasks. Among these smart technologies, the Artificial Intelligence and Computer Vision (CV) branches have been gaining relevance in recent years, due to their contributions to Intelligent Manufacturing Systems [7
]. In these scenarios, various CV techniques have been tested successfully by many academics [8
]. In addition, some examples of Visual Computing technologies, such as CV, Augmented Reality, and Virtual Reality, targeted at empowering and supporting the operators of smart factories [10
], have been proposed for the monitoring of movements and coherently managing co-bot actions, in order to make the training of new staff easier or to help in the assembly check of particularly small electronic components [12
]. According to Coffey [13
], the global market of Machine Vision (MV), which is the declension of Computer Vision into the Industrial domain, will grow from US$
8.54 billion in 2017 to US$
16.89 billion in 2026. The fortune of MV can be attributed to its characteristics: non-contact, reliable, safe, suitable for harsh environments, and designed for working long times, among other aspects. The economic value of MV runs parallel with the examples of its practical applications [14
], such as object measurement, object location, image recognition, object existence detection, and object defect detection [15
]. Among the variety of manufacturing-related topics related to MV technologies, we find flexible manufacturing [16
]. Nevertheless, the vast majority of studies have focused on use-cases in which human interaction is absent and the objects that have to be monitored are either stationary or moved by a conveyor [19
]. The objects are manually moved sometimes, but the MV algorithm exploits manual triggering for taking snapshots of the components to be inspected once placed by the operator in the correct position. However, none of the works mentioned above have described how to concretely manage human interactions inside of a framed area.
This research work comes from the necessity of an Italian manufacturing firm to have timely and reliable data about the manual assembly process, specifically the number of assembled pieces by each of the stations of ten assembly lines on the shop floor. With managerial involvement in Lean Manufacturing and Industry 4.0, the tendency is to use a kanban system to manage the material replenishment of assembly lines [20
], in order to keep low inventories in the production lines—and in the factory in general—with the aim of reducing costs and investments locked into raw material and spare parts. This forces managers and logistics operators to have updated and reliable information of the production of assembly lines to align, in real-time, all the upstream and downstream processes in the pull production flow, such as material replenishment, procurement, shipping, and so on. Moreover, being a high-mix low-volume company, the assembly lines are usually flexible and re-configurable: every single line is in charge of the production of several product variants. The assembly sequence is composed of customer orders and frequently involves passing from one product type to another, which implies assembling some components instead of others. The availability of an incremental, reliable, and timely individual count of assembled pieces would help operators in avoiding errors due to miscounted assembled products. By doing this, assembly operators would be supported [22
] in managing the code change, based on the current customer order. In contexts where products are moved by conveyor belts, counting through a MV system might be trivial [24
]; however, in our scenario, assembly can be done only manually, thus making it tricky to count pieces passed from one operator to the next one. Completed pieces, which are ready to be picked up by the following operator, are placed on a small intermediate table.
This paper improves upon our previous work [26
], in which we designed a Machine Vision system able to count pieces that an operator passes to the following one by placing them on an intermediate table. The previous solution was able to count the assembled pieces and partially manage the human interactions through color-based discrimination. However, we reached unsatisfactory performances, precisely caused by the unpredictable and non-standardized interactions of humans in the phases of placing and picking up the pieces. In this work, we propose a new solution which is able to analyze a video stream with pieces manually moved by an upstream assembly operator using a preliminary procedure which decides or not to further process a frame with a suitable counting algorithm. This preliminary inter-frame difference check procedure, called Motion Check, is developed to avoid processing critical frames which might cause counting errors. Since the beginning, we have aimed to be as flexible as possible, envisioning the scalability of the system at a plant-wide level, where the product mix produced is extremely high (i.e., thousands of different products), when compared to the mix produced in one line only (i.e., order of magnitude of ten product variations). For this reason, in the proposed algorithm, we exploited some basic techniques such as image pre-processing, binarization, morphological operations [27
], and blob detection. In this work, we propose the use of the Motion Check method followed by the blob-based counting algorithm to build a simple, computationally fast, and yet very versatile solution which is able to count pieces, mostly based on color and morphology. We avoided testing very advanced MV solutions, based on Convolutional Neural Networks (CNN), for the counting algorithm, as they are too computationally intensive [28
] and, given the mandatory necessity of our system to work in real-time, they likely require huge computational resources such as dedicated GPUs. Furthermore, the usage of a CNN is probably excessive in our context, as the environment is very restricted, and only hands and products interfere in the framed area. This is the reason why very advanced object detectors, such as Deep Learning-based methods, were deemed excessive in comparison to the task at hand. Then, we compared the proposed blob-based counting algorithm to a Machine Learning-based one, an Aggregated Channel Features (ACF) Detector [30
] which was specifically trained to detect the product types produced on the assembly line where we tested both MV algorithms. The ACF custom object detector was taken as a reference, in terms of accuracy of detection and counting, but we must specify that it performed well in the restricted context of that assembly line only, while it fails if we extend its application to other assembly lines where the product shapes and dimensions differs from those of the training set used. On the other hand, the developed blob-based solution is not product type-dependent and works in a manner that allows its plant-wide application (i.e., for other assembly lines also).
In this work, we compare the performance of two alternative solutions equipped with Motion Check. This procedure was added as a preliminary step to the blob-based and ACF-detector counting algorithms; its use is essential to guarantee the real-time capability of the overall algorithm and its ability to deal with manual handling. We implemented the two alternative solutions into a prototype application which provides a visual interface. Therefore, the assembly operator can visualize their production in real-time and, being conscious of the order termination, can successfully manage the critical moment of changing the order. In addition, we developed a prototype of human-friendly tool which allows even non-expert operators to easily adjust system parameters to new assembly lines. This kind of tool is essential for ensuring the plant-wide level implementation of our solution; that is, in every station of every assembly line.
The remainder of the paper is organized as follows: the detailed description of the context of use, of the system setup, and of the Motion Check procedure together with the counting algorithms are presented in the Materials and Methods section. The test setting, performance metrics, and test outcomes are detailed in the Results section; while the comparison, reasoning, and the comprehensive picture are featured in the Discussion section. In the Conclusion section, we briefly recap the entire work, and introduce the next steps we envision for the future of the project.
We would like to specify that all the recorded videos and computer code are available, upon request, from the corresponding author. Our aim was to find the best solution for counting pieces assembled from an operator which places them on the intermediate table as soon as they accomplish their workload. These pieces are progressively picked up, one by one, by the following operator who carries on the assembly process. In order to effectively count the pieces manually placed by an assembly operator on a table, we developed a preliminary Motion Check procedure followed by two different counting solutions, whose performances were compared. One counting solution was designed by us, based on existing image processing techniques; while the other one is a reference method, a consolidated Machine Learning object detector. In order to objectively compare the two counting solutions, the operation managers of the company defined the following requirements that the chosen system must meet, in order to be implemented in the assembly lines of the shop floor:
Count every time an assembled piece is placed on the table;
Do not count whenever a piece is picked up from the table;
Do not count whenever a piece is not placed on the table, in general, given that sometimes operators interfere in the framed ROI of the camera even though they are not placing nor picking up a piece;
Analyse the live video stream for long times without losing any interval;
Be timely in counting; and
Be adaptable to all of the different assembly lines in the company’s shop floor.
The requirements (a), (b), and (c) are synthesized, from now on, with the name “Counting Capability”. Requirement (d) was named the “Real-Time Capability”, requirement, (e) “Responsiveness”, and (f) “Versatility”. As a general result, we also briefly resume the improvements resulting from the installation of the light ring, which is essential for application of the Motion Check procedure. In Figure 11
, we show the behavior of the mt
parameter for an old video, recorded before the light ring installation. The green horizontal line corresponds to the 1.08 threshold, and the magenta dots along this line correspond to every video frame which was further analysed.
The variance due to environmental shadows and natural lighting influences on the illumination of the framed table is evident, if we compare the m values with the timeline of events on the top of the figure. Indeed, the trend is not stable, even within stand-still phases, in contrast with the plot of the m parameter for a video recorded after the light ring installation (see Figure 4
). The excessive fluctuation of mt
results in an extremely higher number of frames further analysed (as indicated by the magenta dots) which, in turn, may affect the real-time and the correct-counting capabilities of the entire system. As a result of the light ring stabilizing the working conditions of the camera, there is no need to change the threshold once it is fixed and the behavior of mt
is not influenced by shadows or natural light changes.
3.1. Testing Counting Capability
To test the “Counting Capability” of the two algorithms as objectively as possible, we decided to acquire videos in MP4 format and apply both algorithms offline to the same videos. In this way, we could be sure that possible imbalances of the performances achieved were only a matter of Counting Capability of the specific solution, and not caused by particular and strange behaviors of the operators under one real-time test that might differ from the behaviors analysed during the real-time test of the other algorithm. Specifically, given a video sequence, the processing algorithms could behave in four ways:
correctly count one piece when the operator places an assembled one on the intermediate table (True Positive, TP);
wrongly count when a piece has not been added on the table (False Positive, FP);
correctly do not count when there have not been new pieces placed on the table (True Negative, TN); or
wrongly do not count when an assembled piece has been placed on the table (False Negative-FN).
According to the definition of these four alternative outcomes and the performance evaluation criteria proposed in [33
], we can define three metrics that summarize the capability of each of the two solutions in managing the counting task:
Sensitivity, computed as the number of TP divided by the sum of the number of TP and FN, measures the solution’s capability of correctly identifying placed pieces and counting;
Specificity, computed as the number of TN divided by the sum of TN and FP, measures the solution’s capability of correctly identifying the picked up pieces without counting; and
Accuracy, computed as the sum of TP and TN divided by the sum of TP, FP, TN, and FN, measures the overall solution’s capability of correctly behaving.
In our case, we expect a comparable total frequency of occurrence of Negatives and Positives, given that we consider all the piece pick ups as Negative samples. Particularly, the recorded videos contained 88 piece pick ups and 90 piece placings, which were unconstrained and completely instinctual. We specify that they were unconstrained as, in our past work, we partially constrained the manner of picking up and placing objects to improve the previous algorithm’s performance.
To show the improvement resulting from the introduction of the Motion Check, we compared the two alternative solutions provided with the Motion Check procedure (proposed in this paper) to their original version (presented in our previous work). Thus, we prove the validity of the counting system with Motion Check, compared to that without. Additionally, by comparison with the improved versions, we prove the comparable performance of the simple and versatile Blob algorithm. In Table 1
, we present the TP, TN, FP, and FN results regarding the offline test performed on videos collected after the installation of the light ring. Specifically, we present the results achieved by the two alternative algorithms, both in their original version where all of the frames were indiscriminately analysed (flowchart and details can be found in our groundwork [26
]) and in their new versions provided with inter-frame Motion Check, as described in this work.
In Table 2
, we report the values of the Sensitivity, Specificity, and Accuracy metrics for the two algorithms in both the original implementation and in the improved ones introduced in this work.
3.2. Testing Real-Time Capability
The aforementioned test could only provide information about the counting capability of the two alternatives once the video files were collected; however, the system should ensure its potential to analyse a continuous video stream without information loss over long times—at least for sixteen hours, given that the company is organized in two working shifts. To test this, we let the Real-time Counting App run on the Macbook for an entire day. As described before, the prototype App, while continuously processing frames, produces a log file containing timestamps for every count, frames further analysed, and even delays in frame capturing. If the interval between two consecutive frame grabbed is higher than 0.3 s, which means the algorithm is processing less than 3 frames per second, then the log file will list the timestamps of the “Frame Capturing Delay” error. By letting the application work for an entire work day and analysing the related log file, we could count the number of times in which the application was too slow in acquiring frames. We found only 1 occurrence of this kind of error in the log file within sixteen hours of continuous real-time processing using the blob-based algorithm for further processing, and 10 occurrences using the detector-based algorithm for further processing; proving that, being more computationally intensive, the latter alternative solution was implemented less successfully for real-time purposes.
3.3. Testing Other Requirements
As far as Versatility is concerned, we conducted an ad-hoc experiment using pieces which are different from the usual types assembled in the considered test assembly station. The only similarity between usual pieces and the ones used in this experiment was the chromatic characterization (they were partly black), while they differed in dimension and shape. The Improved Blob algorithm, with parameters regarding this new context of use which were easily and quickly defined using the Setting Tool App, demonstrated performances perfectly comparable with those reached during the Counting Capability Test with the usual product types presented in Section 3.1
. On the other hand, the Improved ACF, which was trained on the original product type, made several mistakes, due to its unsuitability for the different objects placed on the table during this test. The adaptation of the detector to this new context is not as easy and fast as that for the adaptation of the blob-based solution, due to the mandatory necessity of training which, in turn, necessitates the development of a training image set. Therefore, the ACF proved to be unsuitable for plant-wide implementation.
With regards to Responsiveness, we conducted additional live experiments using the Real-time Counting App, which allows the manual recording of the timestamps connected to the piece placing and autonomously saves the timestamps of the algorithm’s counting decisions. In this way, by simply comparing the two vectors of timestamps, we can compute the mean difference between the moment of placing and the moment of counting. In Figure 12
, we summarized the results of this test done on 200 pieces.
It can be seen that 94% of pieces were correctly counted within 1 second after placing, which means almost simultaneously, while 3% of pieces suffered from a subtle delay of around 1 s. This may be due to the fact that the operator was very slow in taking their hand out of the framed area. The remaining 2% of pieces were not counted, these errors being due to the only eventuality that has not yet been addressed by our algorithms: when both operators simultaneously intervene in the ROI, one picking up and the other placing a piece, generating a balance if we analyze the number of objects present in the previously analysed frame and the number in the currently analysed frame.
5. Conclusions and Future Work
In this work, we proposed a Machine Vision algorithm which is able to analyze a video stream in real-time and automatically count the pieces assembled by an operator and placed on a table in a framed area. The developed algorithm integrates an inter-frame analysis mechanism which handles the human interactions in the framed area that can cause incorrect piece counting. In fact, after the development of a first solution in our preliminary paper, we identified the interaction of the operators with the framed area as a weakness of the previously developed algorithm. In order to overcome these limitations, we introduced the Motion Check phase as a preliminary step before conducting image processing, finalized at counting. This Motion Check is a novel adaptive examiner of motion which is not dependent on the specification of a fixed background, and understands whether there have been relevant movements between the current frame and the previous one. Using Motion Check and exploiting blob detection to identify the objects, the proposed solution was able to reliably count the pieces assembled by an operator. In fact, the proposed solution demonstrated very good performances, in terms of Sensitivity, Specificity, and Accuracy, when tested in a real situation in an Italian manufacturing firm’s shop floor. Moreover, the Real-time Capability, Responsiveness, and Versatility of our solution were evaluated in specific tests.
By analysing the frames corresponding to counting moments, we found that our improved algorithm counts when there are no hands in the framed area. Therefore, if we collect frames corresponding to counting moments, we automatically have a perfect and considerable data set for training a Machine Learning-based detector, or even a Deep Learning based one. We envision the possibility of merging the basic blob-based solution as a preliminary automatic way for developing a more robust detector-based solution for every station and for every assembly line. With this insight, we aim to simplify the development of an advanced Machine vision detector for custom object recognition purposes, even for non-experts. Another point we want to address is that, with the presented configuration of the solution, we analyse between 3 and 10 fps; nonetheless, we would like to improve the frame rate, speeding up computation by implementing the algorithms directly on a dedicated hardware platform provided with FPGA and GPU. This goes together with a need for system optimization, in order to ensure the correct and real-time analysis of multiple converging video streams. Coherently, source coding and communication protocols have to be optimized and tested for the design of a system architecture which exploits either edge computing or cloud computing for carrying out the simultaneous and continuous processing of several and parallel video streams, which is the final objective of the company.