Gaussian Multiscale Aggregation Applied to Segmentation in Hand Biometrics

This paper presents an image segmentation algorithm based on Gaussian multiscale aggregation oriented to hand biometric applications. The method is able to isolate the hand from a wide variety of background textures such as carpets, fabric, glass, grass, soil or stones. The evaluation was carried out by using a publicly available synthetic database with 408,000 hand images in different backgrounds, comparing the performance in terms of accuracy and computational cost to two competitive segmentation methods existing in literature, namely Lossy Data Compression (LDC) and Normalized Cuts (NCuts). The results highlight that the proposed method outperforms current competitive segmentation methods with regard to computational cost, time performance, accuracy and memory usage.


Introduction
Hand biometrics is receiving an increasing attention at present because of their huge applicability in daily scenarios and the relation between user acceptance and identification/verification rates [1,2].
The characteristics of this biometric technique in terms of non-invasiveness and acceptability highlight the fact that hand biometrics could be a proper and adequate biometric method for verification and identification in devices like PC or mobile phones, since hand biometrics system requirements are easily met with a standard camera and hardware processor.
However, as applications requiring hand biometrics tends to contact-less, platform-free scenarios (e.g., smartphones [3]), hand acquisition (capturing and segmentation) is being increased in difficulty. In other words, hand biometrics is evolving from constrained and contact-based scenarios [4,5] to opposite approaches where less collaboration is required from individuals [3,6], providing non-invasive characteristics to this biometric technique, and thus, improving its acceptability.
Consequently, image pre-processing becomes compulsory to tackle with this problem, by providing an accurate segmentation algorithm to isolate hand from background, whatever its nature, and independent from environment and illumination conditions. Thus, a segmentation method is proposed able to isolate hand from different background, regardless the environmental and illumination conditions.
The proposed approach is based on multiscale aggregation, gathering pixels along scales according to a given similarity Gaussian function. This method produces an iterative clustering aggregation, providing a solution for hand image segmentation with a quasi-linear computational cost and an adequate accuracy for biometric applications.
The method has been tested with a synthetic image database, with around 408,000 images considering different backgrounds (e.g., soil, skins/fur, carpets, walls or grass) and illumination environments, and compared to two competitive approaches in literature in terms of image segmentation. These approaches are named Lossy Data Compression (LDC) [7] and Normalized Cuts (NCut) [8].
Finally, the layout of the paper remains as follows: Section 2 provides and overview on the current literature, describing the proposed method under Section 3. The database involved in evaluation is presented in Section 4, together with the results, presented in Section 5, providing conclusions and future work in Section 6.
In fact, the overall performance in terms of identification accuracy relies strongly on the result provided by the segmentation and pre-processing procedure.
Concerning hand-based biometrics, segmentation has received little attention in early works, provided that initial approaches carry out the acquisition procedure in a constrained and homogeneous background [4,18]. This background was selected so that hand segmentation is a trivial task by simple thresholding.
However, as hand biometrics is evolving from contact and peg-based approaches to completely contact-less, peg-free and platform independent scenarios, hand segmentation is increasing its difficulty and complication [6,19,20].
Several approaches in literature tackle with this problem by providing non-contact, platform-free scenarios but with constrained background, usually employing a monochromatic color, easily distinctive from hand texture by means of simple image thresholding [21][22][23]. More realistic environments propose a color-based segmentation, detecting hand-like pixels either based on probabilistic [24], clustering methods [25] or edge detection [4,5,20].
The most common applications of this approach consider image segmentation and boundary detection based on texture [29,31], providing accurate results when compared to human segmentation and other competitive approaches in literature [32].
The results obtained by multiscale aggregation in the fields of unsupervised image segmentation are certainly promising [32], and the application of this method for hand segmentation has been recently proposed [3].
Nonetheless, several aspects must be improved in terms of computational cost and memory usage efficiency [3,30,32]. In fact, these methods are strongly dependent on the number of pixels in an image, and only small images are supported. This limitation was partially solved [3,30], providing a quasi-linear segmentation method, described in detail in the following section.

Gaussian Multiscale Aggregation
The proposed approach attempts to provide an accurate segmentation of a colour hand image. The algorithm strategy consists of aggregating similar nodes according to a specific criteria along different scales until a given goal is met, ensuring that aggregated nodes within segments verify certain properties.
First step of the algorithm consists of providing a particular structure to the amount of elements within the image. Likewise to other methods [30], the proposed algorithms assumes that a given image I can be represented by a graph G = (V, E) where nodes in V represent pixels in the image and edges in E stands for the structure provided to the set of nodes.
In this approach, the structure on the first scale is assumed to be a 4-neighbourhood strategy, while for subsequent scales, structure is provided by means of Delaunay triangulation [33].
In addition, each node is represented by a similarity function denoted by φ where v i ∈ V designates a node in graph G and s indicates the scale the element v i belongs to. This similarity function is described in terms of relative measures with respect to intensity average and standard deviation.
More in detail, φ [s] v i is represented by a gaussian distribution N (µ, σ) where µ and σ specify the average and standard deviation neighbour intensity, provided the 4-neighbourhood structure.
Thus, similarity functions leads to the concept of likelihood between nodes in connecting edges, providing a definition of weights within graph G.
Given a graph G = (V, E), the similarity among pair of nodes is provided by means of weights W, which are defined for each scale s as: where v i , v j ∈ V, ∀i, j and φ v j represent the similarity function for nodes v i and v j , respectively. In addition, α stands for the selected colour space, which in this paper corresponds to the a layer of the CIELAB (CIE 1976 L*,a*,b*) colour space, due to its ability to describe all visible colors by the human eye [9].  i,j associated (striped region).
Therefore, graph G = (V, E, W) contains not only structural information on a given scale s but also relational details about the similarity of each node neighbourhood.
Furthermore, W i,j can be regarded as the weight associated to edge e i,j , so that W i,j = W(e i,j ). Notice that weights are not defined for each pair of nodes in V, but only for those pairs of nodes with correspondence in edge set E.
Some properties can be extracted from the definition of W i,j ∈ W as the similarity between two nodes v i and v j , then W i,j satisfies ∀i, j: Property (1) results from the definition given by Equation (1), since the integration of two non-negative functions provides a non-negative result. Similarly, property (2) is derived from the commutative product of a function product. Property (3) indicates that maximum value of weight is obtained if and only if nodes v i and v j have the same similarity distribution.
These former properties stand for each scale s, although for the sake of simplicity this index was not included on previous notation. Furthermore, each node v i ∈ V contains also information on the location within the image in terms of positions, which will be useful in posterior scale aggregation steps.
On the other hand, the essence of this algorithm relies on aggregation, which consists of grouping and clustering those similar nodes/segments in subgraphs, according to some criteria along scales.
The proposed method bases the aggregation procedure on the weights in W, given the fact that, those pairs of nodes/subgraphs with higher weights are more similar than those with lower weights, and therefore, those former pairs deserve to be aggregated under a same segment/subgraph. Thus, a function must be defined to provide some order in set W, so that posterior subgraphs in subsequent scales contain nodes with high weights and, therefore, high similarity.
Let Ω be an ordering function, which orders edges in E according to W, as follows: In other words, let e = {e 1 , . . . , e m } be a set of edges.
If Ω is applied to previous set e, then it is satisfied that Ω(e) i ≥ Ω(e) j , with i ≤ j, ∀ i, j, being Ω(e) i the ith element in the ordered set Ω(e). Concretely, Ω W represent the weight set W after Ω is applied.
Once the concept of ordering function is introduced, the algorithm aggregates pair of nodes based on this former weight ordering, ensuring that the dispersion of each segment remains bounded. This aggregation criteria is represented by the Equation (3): where σ i,j represent the dispersion of aggregating nodes v i and v j . Despite of selecting the geometric mean as the comparison criteria in previous equation, other methods are possible such as arithmetic mean, generalized mean or harmonic mean. The selection of geometric mean was carried out based on experimental results. Once pairs of nodes have been ordered and an aggregation criteria have been stated, the Gaussian Multiscale Algorithm aggregates pair of nodes with previous criteria Equation (4), considering the fact that G In other words, GMA approach aggregates a pair of nodes in scale s under the same existing segment when at least one of both is already assigned to a segment. In case none has previously assigned to any segment, a new segment is provided. In all previous cases, aggregation is carried out as long as δ i,j holds, otherwise, different segments are assigned to previous pair of nodes.
In addition, the number of assigned graphs in scale s is given by p, whose description is provided in Equation (5), which depends onδ i,j = 1 − δ i,j as follows: where function ξ i,j is defined as This assignment is done for each value in the ordered set Ω W , until whether every element in Ω W is evaluated or every node in V is assigned a segment in subsequent scale. Gaussian Multiscale Aggregation assures that every node in scale s−1 is assigned a segment/subgraph in scale s.
After aggregation, nodes in scale s are gathered into p subgraphs, with p < N [s] , being N [s] the number of nodes in scale s. Each subgraph contains a set of nodes, whose number is unknown a priori. These subgraphs must be compared in subsequent scales, and thus the similarity function in subgraphs is defined in Equation (7).
Consequently, let G Notice that the definition of the similarity functions φ G [s+1] n has sense also for individual nodes in V, considering nodes as graphs of one element. This is essential during the aggregation in first scale, where graphs gathers nodes instead of subgraphs. In this case, function φ [0] is represented by a gaussian function of mean and deviation corresponding to the average and dispersion intensity of their neighbour nodes, as stated before.
Therefore, similarity functions can be completely defined as in Equation (8) φ where N (µ, σ) stands for the gaussian distribution given an average µ and a σ, both of them corresponding to their respective neighbour properties. For clarity sake, first scale (s = 0) is obtained based on nodes v ∈ V and subsequent scales are obtained by gathering subgraphs. Concerning location, the position of subgraphs is obtained by averaging the position of the nodes contained on each subgraph. This is essential in order to provide a neighbourhood structure, since after aggregation every scale s collects a scatter set of subgraphs. This structure is given by means of Delaunay triangulation.
A Delaunay graph for a set S = {p 1 , . . . , p n } of points in the plane has the set S as its vertices. Two vertices p i and p j are joined by a straight-line (representing an edge) if and only if the Voronoi regions V (p i ) and V (p j ) share an edge. In addition, for a set of points in R 2 , knowing the locations of the endpoints permits a solution in O(nlogn) time. Therefore, Delaunay triangulation is a suitable method to provide a neighbourhood structure to previous aggregated subgraphs.
This operation represents the final step in the loop, since at this moment, there exist a new subgraph represents a node, and edges E [s+1] are provided by Delaunay triangulation, and weights W [s+1] are obtained based on Equations (1) and (8).
The whole loop is repeated until only two subgraphs remain, as stated at the beginning of this section. However, due to the constraints provided to aggregate (Equation (9)), the method could not aggregate more segments, without achieving the goal of dividing image into two subgraphs. Therefore, Equation (3) is in practice relaxed and stated as follows in Equation (9): being k [s] a factor able to avoid aggregation method from being stuck in the loop. This factor can be dynamically increased or decreased, according to previous method necessities. However, this value is initially set to k [s] = 0.01, for each scale s. The capability of k [s] to adapt the necessities of the algorithm remains as future work. The computational cost of this algorithm is quasi-linear with the number of pixels, since each scale gathers nodes in the sense that nodes in subsequent scales are reduced by (in practice) a three times factor ( Figure 4). Therefore, time to process the first scale (which contains the highest number of nodes) is greater than the rest of times to process subsequent scales, and the total time is approximately comparable to two times the processing time to aggregate first scale. This statement will be supported within the results of Section 5.

Database
After presenting the algorithm, next section describes the creation of the database involved in evaluation.
This section describes the creation of a synthetic database containing a total of 408,000 images of hands with a wide range of possible backgrounds like carpets, fabric, glass, grass, mud, different objects, paper, parquet, pavement, plastic, skin and fur, sky, soil, stones, tiles, tree, wall and wood.
The main aim of this database is twofold: • First, the main purpose is to provide a comparative evaluation frame for segmentation algorithm, where existing approaches in literature could be compared. In other words, this database makes it possible to assess to what extent the segmentation algorithm can satisfactory perform a hand isolation from background on real scenarios.
• In addition, this database contains the ground-truth result for each image, providing a possible supervised evaluation criteria. These ground-truth images were obtained, given that hands were taken with a blue-coloured background, so that hand can be easily extracted by simple thresholding [22].
The creation of the synthetic database (named GB2S Database) considers the hands extracted in former database and the set of the aforementioned different textures, which were obtained from the website http://mayang.com/textures/.
First of all, a straightforward segmentation was carried out with a threshold-based segmentation [22], obtaining two binary masks: M h , corresponding to those pixels representing hand, and M b with pixels corresponding to background.
Afterwards, both masks are laid one over each other, with M b containing pixels associated to a specific texture, resulting in an image with the hand over a desired background (grass, water, wood and so forth).
In order to ensure there is no considerable difference in illumination between hand and background, each image is converted from RGB to YCbCr color space [9] carrying out a histogram equalization in terms of illumination (Y), performing afterwards the inverse transform from YCbCr to RGB color space. Finally, a morphological operation consisting on an opening operator with a structural element of a disk of small size (5 pixels radio) is considered to fade the boundary between hand and background, so that hand is integrated within background.
All these former operations attempt to ensure a fair scenario, simulating the conditions provided in real situations.
For each hand image, a total of 5 × 17 (five images and 17 textures) synthetic images are created, collecting a total of 120 × 2 × 20 × 5 × 17 = 408,000 images (120 individuals, two hands, 20 acquisitions per hand, five images and 17 textures) to properly evaluate segmentation on real scenarios. Some visual examples of this database are provided in Figure 2.
This presented database is publicly available at http://www.gb2s.es/. Once the database has been presented, the following section comes out with the evaluation of the algorithm and the obtained results.

Results and Discussion
This section contains the results of the comparative evaluation of the proposed approach to LDC [7] and NCut [8]. First, the evaluation criteria is stated in order to provide a comparative frame, providing afterwards the results obtained in the evaluation.

Evaluation Criteria
Although there exist some unsupervised evaluation methods for image segmentation [32,34,35], we have preferred a supervised segmentation, since the synthetic database GB2S contains the corresponding ground-truth associated to each image. Segmentation results will be compared to this ground-truth image.
The proposed evaluation method is based on F-measure, [36], defined as follows: where P (Precision or Confidence) stands for the number of true positives (true segmentation, i.e., classify a hand pixel as hand) in relation to the number of true positives and false negatives (false hand segmentation), and R (Recall or Sensitivity) represents the number of true positives in relation to the number of true positives and false positives (false background segmentation, i.e., consider background as hand). F-measure is within the [0, 1] interval, so that 0 states a bad segmentation, while on the contrary 1 represents the best segmentation result. Aiming a fair comparison, the propose algorithm is compared to two competitive segmentation methods existing in the literature, namely Lossy Data Compression (LDC) [7] and Normalized Cuts (NCut) [8].

Gaussian Multiscale Aggregation Evaluation
The evaluation of a segmentation method involves different aspects concerning accuracy, computational cost and parameters dependency.
First aspect is related to what extent the algorithm is able to properly detect or isolate a specific object within an image. Concretely in this paper, accuracy is understood as the capability of the proposed algorithm to properly isolate hand from background. Table 1 shows the results in terms of F-measure of the proposed methods in comparison to LDC approach and Normalized Cuts. Although the results obtained by the proposed method (first column) can be improved, they overcome the other two schemes. Reader may notice that scenarios with textures similar to hand (e.g., soil) decrease the performance of the segmentation algorithms, but the proposed method still provides F-measure rates of more than 88%.
In addition, accuracy can be also visually evaluated. Figure 3 presents a comparative frame for segmentation evaluation, comparing the results obtained for the LDC method, Normalized Cuts and the proposed method. Reader can compare the obtained results (columns 4-6) to the ground-truth (column 2). The results obtained by the proposed approach conserve more precisely the shape of the hand even in scenarios with similar textures like parquet (row 5) or wood (last row).  [7] and Normalized Cuts [8], respectively.
Original Image Ground-truth Synthetic Image Proposed LDC approach Normalized Cuts Secondly, concerning computational cost, Table 2 presents the segmentation time in relation to the number of pixels of the images. This temporal evaluation was carried out in a PC computer @2.4 GHz Intel Core 2 Duo with 4 GB 1,067 MHz DDR3 of memory, considering that the proposed method was completely implemented in MATLAB. The results provided in Table 2 shows that the proposed algorithm is faster than the compared approaches. In addition, the proposed method can segment images of higher sizes, but LDC and NCut cannot handle higher sizes images without running out of memory.
Finally, this section will study the dependency of two parameters strongly related to algorithm performance, namely k factor and aggregation linearity (Equations (9) and (10)).
Factor k controls the aggregation capability of the overall method. Within these experiments, factor k was experimentally fixed to k = 0.01, ensuring that the number of segments in the last scale is two: hand and background. However, extending the proposed approach to other applications in image processing would imply to provide a dynamic factor k, depending whether the algorithm standstills in a certain scale. The proposal of a dynamic factor k remains as future work. Figure 4 presents the relation between number of segments along scales using different values of k. Notice that k = 0 implies no stopping criteria, and therefore aggregates scales until only one segment is obtained. Figure 4. Dependency of the aggregation process on parameter k. The lower k, the lower the constraints to aggregate segments. Notice that k = 0 means no stopping condition. During the explanation of the method, the algorithm is said to be quasi-linear with the number of pixels. This statement is supported by Table 2, but for clarity sake, we provide a chart ( Figure 5) indicating which proportion of time is required for each scale. The most demanding scale is the first one, whose proportion is higher than the other parts, concluding that the algorithm has indeed a quasi-linear behaviour in relation to the number of pixels.

Conclusions and Future Work
The application of hand biometrics to unconstrained and contact-less, platform-free environments implies an increase in difficulty in the pre-processing and segmentation procedure in hand acquisition. Therefore, an unsupervised segmentation algorithm has been proposed based on Gaussian multiscale aggregation. This method gathers iteratively those pixels similar in texture and color under segments, until a certain number of clusters/segments is provided as a result.
This method is able to isolate hand from a wide range of backgrounds (carpets, fabric, glass, grass, mud, different objects, paper, parquet, pavement, plastic, skin and fur, sky, soil, stones, tiles, tree, wall and wood), simulating real situations and unconstrained background scenarios.
Besides, the evaluation of the proposed approach has been carried out based on a publicly available synthetic database, containing 408,000 hand image acquisitions with different background textures. The evaluation consisted of a comparison of the performance in terms of accuracy and computational cost to two competitive segmentation methods existing in literature, namely Lossy Data Compression (LDC) [7] and Normalized Cuts (NCuts) [8].
The results obtained point out that the performance of the proposed algorithm outcomes existing segmentation algorithms in literature, regarding not only accuracy and computational cost, but also memory usage, since the proposed algorithm is quasi-linear in relation to the number of pixels.
As future work, we consider to implement the method with a dynamic k parameter, so that the algorithm can be adapted to any image, providing segmentations of more complex images. In addition, we aim a faster implementation of the method considering both software and hardware optimized implementation, together with a more complete evaluation with other publicly available databases.