End-to-end systems are an appealing strategy for system design in machine learning research, because it makes less assumptions about how the system works internally. For our table tennis vision setup, an end-to-end system should receive the input images from all the cameras and output the corresponding ball location in 3D cartesian coordinates. However, such an end-to-end solution would have a number of disadvantages for our table tennis setup. For example, adding new cameras or moving around the existing cameras would require to re-train the entire system from scratch.
We divide our vision system into two subsystems. The object detection subsystem that outputs the ball positions in pixel space for each image, and the position estimation subsystem that outputs a single 3D position of the ball based on a camera calibration procedure. To add new cameras we only need to run the calibration procedure, and moving existing cameras requires only the re-calibration of the moved cameras.
First, we discuss about different methods used in the machine learning community to detect general objects in images. In particular, we discuss about object detection and semantic segmentation methods, and their advantages and disadvantages for the ball detection problem. We show that although both methods can successfully find table tennis balls in an image, the semantic segmentation method can be used with smaller models, achieving the required real-time execution requirements we need for robot table tennis.
Subsequently, we discuss how to estimate a single 3D ball position from multiple camera observations. We focus particularly on how to deal with erroneus estimates of the ball position in pixel space, for example, when the object detection method fails and reports the location of some other object. We analyze the algorithmic complexity of the proposed methods and we also provide execution times in a particular computer for setups with different number of cameras.
2.1. Finding the Position of the Ball in an Image
The problem of detecting the location of desired objects in images has been well studied in the computer vision community [18
]. Finding bounding boxes for objects in images is known as object detection. In [19
], a method called Single Shot Detection (SSD) was proposed to turn a convolutional neural network for image classification into an object detection network. An important design goal of the SSD method is computational efficiency. In combination with a relatively small deep network architecture like Mobilnet [20
], which designed for mobile devices, it can perform real-time object detection for some applications.
, shows example predictions of a Mobilnet architecture trained with the SSD method in a ball detection dataset. Each picture shows a section of the image with the corresponding bounding box prediction. The resulting average processing speed using a GPU NVidia GTX 1080 was 60.2 frames per second on 200 × 200 pixel resolution images. For a 4 camera robot table tennis setup, this would result in about 15 ball observations per second. Unfortunately, for a high speed game like table tennis, a significantly higher number of ball observations is necessary. However, we consider important to mention the results we obtained with fast deep learning object detection techniques like the SSD method, because it can be used with our method for a different application where the objects to track are more complex and the required processing speeds are lower.
An alternative approach to find objects in images is to use a semantic segmentation method, where the output of the network is a pixelwise classification of the objects of interest or background. For example, [21
] uses deep convolutional neural networks to classify every pixel in a street scene as one of 20 categories like car, person and road. To track the ball we only need two categories: Ball and Background. We consider background anything that is not a table tennis ball. Let us denote the resulting probability image as a matrix
is a scalar denoting the probability that the pixel
of the original image corresponds to a ball pixel or not.
In order to find the actual set of pixels corresponding to the ball, we need some kind of threshold based algorithm that makes a hard zero/one decision of which pixels belong to the object of interest based on the obtained probabilities. We used a simple algorithm that consists of finding the pixel position with maximum probability and a region of neighboring pixels with a probability higher than a given threshold.
Algorithm 1 shows the procedure to obtain the set of pixels corresponding to the ball from the probability image . The procedure receives two threshold values and , that we call high and low thresholds, respectively. In Line 1, we find the pixel position with maximum probability on the probability image . If the maximum probability is lower than the high threshold value we consider that there is no ball in the image and return an empty set of pixels. Otherwise, Lines 5 to 15 find a region of neighboring pixels O around the maximum with a probability larger than the low threshold using a Breadth First Search algorithm. The center of the ball is computed by averaging the pixel positions in O.
|Algorithm 1 Finding the set of pixels of an object.|
|Input: A probability image , and a high and low thresholds and . |
Output: A set of object pixels O
whileq is not empty do
for each neighbors of do
if not and then
The computational complexity of Algorithm 1 is linear on the number of pixels. If represents the total number of pixels in the image and the number of pixels of the object to track, the computational complexity of Line 1 alone in and the complexity of the rest of the algorithm is . However, Line 1 can be efficiently implemented in a GPU, whereas the rest of the algorithm is harder to implement on a GPU due to its sequential nature. Given that , we decided to use the GPU to execute Line 1 and implemented the rest of the algorithm in the CPU. In combination with the semantic segmentation approach using a single convolutional unit, we obtained a throughput about 50 times faster than the SSD method for our ball tracking problem.
shows the semantic segmentation results for the table tennis problem using a single convolutional unit with a 5 × 5 pixels filter size. The picture on the left shows a section of the image captured with our cameras. The picture on the center shows the probability image
assigned by the model to each pixel as being the ball, where white means high probability and black low probability. The picture on the right shows a bounding box that contains all pixels in O
returned by Algorithm 1. Note that all the objects in the scene that are not the ball are assigned by the model a very low probability of being the ball, and most of the pixels of the ball are assigned a high probability of being the ball. Actually, the only object that can still be seen not completely dark in the probability image is the human arm, because it has a similar color to the ball in comparison with the rest of the scene.
The throughput of the single 5 × 5 convolutional unit is about 50 times higher that the throughput of the SSD method on the same hardware with our implementations. As a result, we decided to use the single convolutional unit as the ball detection method, achieving the necessary ball observation frequency and accuracy for robot table tennis. In Section 3
, we analyze in detail the performance and accuracy of the single convolutional unit. In addition, we compare the accuracy of our entire proposed system with the RTBlob vision system [8
] and evaluate the playing performance of an existing robot table tennis method [9
] using the proposed system.
2.2. Robust Estimation of the Ball Position
Once we have the position of the ball in pixel space in multiple calibrated cameras, we proceed to estimate a single reliable 3D ball position. The process to obtain an estimation of the 3D position of an object given its pixel space position in two or more cameras is called stereo vision. For an overview in stereo vision refer to [22
Previous work on vision systems for robot table tennis focused on providing real-time 3D ball positions without considering the possibility of errors in the ball detection algorithm. As a result, robot table tennis systems like [5
] had to include outlier detection techniques on the 3D observations using for example physics models. When an observation deviated significantly from the position predicted with the physics model, the observation was discarded. The main drawback of this approach is that we have to discard a complete 3D observation even if only one camera made a mistake localizing the ball in pixel space and the rest of the cameras provided a correct observation. Adding more cameras to such a system would end up in discarding more ball observations instead of improving the entire system reliability. Instead, our approach consists on detecting outliers on the 2D pixel space ball positions individually for each camera. If a small set of cameras detect the ball position incorrectly, but a different set of cameras detect the ball correctly, we can still provide a 3D observation using the correct set of cameras only.
Outlier detection algorithms have received a lot of attention by the scientific community. The more common approaches consist of detection of atypical values in a unsupervised or supervised fashion [23
], modeling a distribution of the typical values in the former case or detecting outliers as a classification problem on the latter. These kind of approaches are not usuful in our case, since the distribution of the ball position in the image space is not expected to be different from the position of other objects that could be mistaken by the ball. Instead, we focus on outlier detection by consensus [24
]. Examples of algorithms for outlier detection by consensus include RANSAC [25
], MLESAC [26
], NAPSAC [27
] and USAC [28
]. All these algorithms consist on taking a random subsample of the observation set, fitting the parameters of the model of our interest only on the subsample set, and keeping best model parameters according to some optimality criteria. The differences between all these algorithms lies on the criteria to select the subsets of observations and to decide what is the best model found so far.
The RANSAC (Random Sample Consensus) algorithm [25
] is probably the most popular method of outlier detection by consensus used in the computer vision community [24
]. The random subsamples are typically taken uniformly at random. A model hypothesis is generated by fitting the model with the random subsample. An observation is considered consistent with the hypothesis if the error is smaller than a parameter
, and the hypothesis with the largest number of consistent observations is selected as the best one. The MLESAC [26
] algorithm, maximizes the likelihood of the inlier and outlier sets under a probabilistic model instead of maximizing the size of the consistent set. The NAPSAC [27
] algorithm uses the observation that inliers often are closer to each other than outliers. Instead of selecting the subsample set uniformly at random, the NAPSAC algorithm generates a subsample set taking into consideration the distance between the observations.
We propose a consensus-based algorithm that maximizes the size of the support set like RANSAC does, but instead of selecting the subset of observations at random, we try all possible camera pairs, guaranteeing to find a solution if it exists. In the rest of this section, we explain in detail our consensus-based algorithm to find a single the 3D position from several 2D pixel space observations containing outliers. We assume we have access to two functions project and stereo available from an stereo vision library, as well as the projection matrices
for each camera i
. Given a 3D point X
, the function
returns the pixel space coordinates
of projection of X
in the image plane of camera i
. For the stereo vision method, we are given a set of pixel space points
different cameras and their corresponding projection matrices
, and obtain an estimate of the 3D point X
Intuitively, the function stereo finds the point X
that minimize the pixel re-projection error given by
where dist is some distance metric like euclidean distance.
We assume that from a set S of pixel space ball observations reported by the vision system, some of the observations are correctly reported ball positions and the rest of the reported observations are erroneously reported ball positions. We call the inlier or support set and the outlier set. We would like to find the 3D ball position X that minimizes using only the support set .
We define a set of pixel space observations as consistent if there is a 3D point X such that , where is a pixel space error tolerance. We estimate by computing the largest subset of S that is consistent. The underlying assumption is that it should be hard to find a single 3D position that explains a set of pixel observations containing outliers. On the other hand, if the set of observations contains only inliers, we know it should be possible to find a single 3D position X, the cartesian position of the ball, that explains all the pixel space observations.
Algorithm 2 shows the procedure we use to obtain the largest consistent set of observations. Note that we need at least two cameras to estimate a 3D position. Our procedure consists in trying all pairs of cameras , estimating a candidate 3D position only with those two observations, and subsequently counting how many cameras are consistent with the estimated candidate position. If c represents the number of cameras reporting a ball observation, the computational complexity of this algorithm is .
For a vision system of less than 30 cameras, we obtained real-time performance even using a sequential implementation of Algorithm 2. Nevertheless, it is easy to parallelize Algorithm 2. Note that the outermost two for loops can be run independently in parallel. In Section 3
, we evaluate the real-time performance and accuracy of the 3D estimation simulating scenarios with different number of cameras and probability of outliers. Unlike previous robot table tennis systems that discard entire 3D observations using physics based models [5
], we obtain improved accuracy and less dropped observations as the number of cameras on the vision system is increased. We also evaluate the error in the real system and compare it with the RTBlob method using the same experimental setup with four cameras, obtaining a higher accuracy and reliability with the vision system proposed in this paper.
|Algorithm 2 Remove outliers by finding the largest consistent subset of 2D observations for stereo vision.|
|Input: A set of 2D observations and camera matrix pairs , and pixel error |
Output: A subset of maximal size without outliers.