Texture Segmentation: An Objective Comparison between Five Traditional Algorithms and a Deep-Learning U-Net Architecture

: This paper compares a series of traditional and deep learning methodologies for the segmentation of textures. Six well-known texture composites ﬁrst published by Randen and Husøy were used to compare traditional segmentation techniques (co-occurrence, ﬁltering, local binary patterns, watershed, multiresolution sub-band ﬁltering) against a deep-learning approach based on the U-Net architecture. For the latter, the effects of depth of the network, number of epochs and different optimisation algorithms were investigated. Overall, the best results were provided by the deep-learning approach. However, the best results were distributed within the parameters, and many conﬁgurations provided results well below the traditional techniques.

In recent years, advances in artificial intelligence have revolutionised image processing tasks. Several deep learning approaches [41][42][43] have achieved outstanding results in difficult tasks such as those of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [44]. Convolutional Neural Networks (CNNs) are well suited to analyse textures as their repetitive patterns can be learned and identified by filter banks [45]. The U-Net architecture proposed by Ronneberger [46] has become a very widely used tool for segmentation and analysis reaching thousands of citations in the few years since it was published. U-Nets have been used widely, for instance, for road extraction [47], singing voice separation [48], automatic brain tumour detection and segmentation [49] and cell counting, detection, and morphometry [50]. The success of these deep learning approaches in very different areas invites for their application on texture analysis.
In this work, a U-Net architecture for the segmentation of textures is implemented and objectively compared against several popular traditional segmentation strategies. The traditional algorithms (co-occurrence matrices [5], watershed [51], local binary patterns (LBP) [52,53], filtering [54] and multiresolution sub-band filtering (MSBF) [8]) were selected as these have been previously published using the texture composites proposed by Randen [55] and thus an objective numerical comparison is possible.
To perform an objective comparison, six well-known texture composites from the Brodatz [56] album, first published by Randen and Husøy [54], are segmented with U-Nets of different configurations and parameters and the results compared against previously published results. The effects of the configuration of the networks, namely, number of epochs, depth of the network in the number of layers, and type of optimisation algorithm are assessed. All the programming was performed in Matlab R (The Mathworks TM , Natick, MA, USA) and the code is freely available through GitHub (https://github.com/reyesaldasoro/Texture-Segmentation).

Texture Composite Images
Six composite texture images were segmented in this work ( Figure 1). The first five composites are images of 256 × 256 pixels and consist of five different textures whilst the last one is 512 × 512 pixels and is formed with 16 different textures. The masks with which these were formed are shown in Figure 2. It should be highlighted that these textures have been histogram equalised prior to the arrangement and thus they cannot be distinguished by the general intensity of each region. It is frequent that comparisons are made over textures that are not equalised (e.g., [57] Figure 3, [45] Figure 2) and thus the segmentation is not only based on the texture but the average intensity of the regions. Furthermore, whilst some textures are easy to distinguish, there are some that are quite challenging, for instance, the difference between the central and bottom regions in Figure 1c

Training data
The training data in [54] is provided separately and is shown in Figure 3 for the first five composites and in Figure 4 for the last case. For the purpose of training the U-Nets, the training images were tessellated into sub-regions of 32 × 32 pixels each.

Training Data
The training data in [54] is provided separately and is shown in Figure 3 for the first five composites and in Figure 4 for the last case. For the purpose of training the U-Nets, the training images were tessellated into sub-regions of 32 × 32 pixels each.
Pairs of textures and labels were constructed simultaneously in the following way: two training images were selected. Sub-regions of each image were selected and for every pair of the sub-regions, half of each was selected and placed together so that a new 32 × 32 patch with both textures was created with a corresponding 32 × 32 patch with the classes. The patches were created with diagonal, vertical and horizontal pairs. The training images were traversed horizontally and vertically without overlap creating numerous training pairs. A montage of the texture pairs and labels corresponding to Figure 1a is illustrated in Figure 5. All pairs between classes were considered i.e., 1-2, 1-3, 1-4, 1-5, 2-1, 2-3, . . . , 5-3, 5-4. In total, 2940 patches were created for the five composites with five textures and 35, 280 were created for the composite with sixteen textures.
The traditional algorithms have been thoroughly described in the literature; however, for completeness, a short explanation of how features are extracted with each algorithm will follow. For a discussion of traditional texture techniques, the reader is referred to any of the following reviews [58][59][60].
Co-occurrence matrices are constructed from a quantised version of a grey level image so that if an image is quantised to 8 levels, the co-occurrence matrix will have 8 rows and columns. The values of each location of the matrix will depend on the number of times that a pair of grey levels jointly occur at a neighbouring distance (e.g., 1 pixel away) with a certain orientation (e.g., horizontally). In this way, a co-occurrence matrix is able to measure local grey level dependence: textural coarseness and directionality. For example, in coarse images, the grey level of the pixels change slightly with distance, while for fine textures the levels change rapidly. From this matrix, different features like entropy, uniformity, maximum probability, contrast, difference moment, inverse difference moment and correlation can be calculated [5]. Once the features have been calculated, classifiers can be applied directly, or further processing like the watershed transforms can be applied.
Watershed transforms are based on a topographical analogy of a landscape. Should water fall in this landscape, it would find the path through which it could reach a region of minimum altitude, i.e., a basin, sometimes called lake or sea. For each point in the landscape (or pixel of the image) there is a path towards one and only one basin. Thus, the landscape can be partitioned into catchment basins or regions of influence of the regional minima and the boundaries between the basins (e.g., points of inflection) are called the watershed lines. [61]. The watershed transform can be applied to features extracted from the co-occurrence matrix [51]. The basins produced can further be iteratively merged to segment textured regions.
Local binary patterns (LBP) [52], explore the relations between neighbouring pixels. These methods concentrate on the relative intensity relations between the pixels in a small neighbourhood and not in their absolute intensity values or the spatial relationship of the whole data. The underneath assumption is that texture is not properly described by the Fourier spectrum and traditional frequency filters. The texture analysis is based on the relationship of the pixels of a 3 × 3 neighbourhood. A Texture Unit is first calculated by differentiating the grey level of a central pixel with the grey level of its neighbours. The difference is measured if the neighbour is greater or lower than the central pixel. Two advantages of LBP is that there is no need of quantising images and there is a certain immunity to low frequency artefacts. In a more recent paper, Ojala [53] presented another variation to the LBP by considering the sign of the difference of the grey-level differences histograms. Under the new consideration, LBP is a particular case of the new operator called p 8 . This operator is considered as a probability distribution of grey levels, when p(g 0 , g 1 ) denotes the co-occurrence probabilities, they use p(g 0 , g 1 − g 0 ) as a joint distribution.
Filtering, in the context of image processing, consists of a process that will modify the pixel values. There are spatial filters, which are applied directly to the values of the images (e.g., average neighbouring pixels to blur an image) and filters which are applied after a transformation of the data has been performed. Thus a filter in the frequency or Fourier domain will be applied after the image has been converted through the Fourier transform. The filters in the Fourier domain are sometimes named after the frequencies that are to be allowed to pass through them: low pass, band pass and high pass filters. Since textures can vary in their spectral distribution in the frequency domain, a set of sub-band filters can help in their discrimination. One common frequency filtering approach is that of Gabor multichannel filter banks [2,10,[62][63][64].
The partitioning of the Fourier space can be achieved in different ways, Gabor being only one. A multiresolution approach, based on finite prolate spheroidal sequences is described in [8]. The Fourier space is divided into frequencies and orientations, which are further subdivided in a multiresolution approach. Each filter then produces a feature; different textures are captured by different filters. In addition, a feature selection strategy can improve the texture segmentation.

U-Net Configuration
The basic U-Net architecture was formed with the following layers: Input, Convolutional, ReLu, Max Pooling, Transposed Convolutional, Convolutional, Softmax and Pixel Classification. Two levels of depth were investigated by repeating the downsampling and upsampling blocks in the following configurations: The image input layer was configured for the 32 × 32 patches. The convolutional layers consisted of 64 filters of size 3 and padding of 1. The pooling size was 2 with stride of 2. The transposed convolutional had a filter size of 4, stride of 2 and cropping of 1. The numbers of epochs evaluated were 10, 20, 50, 100. The following optimisation algorithms were analysed: stochastic gradient descent (sgdm), Adam (Adam) [65] and Root Mean Square Propagation (RMSprop). One last investigation was performed by training the 20 layer network two separate times to investigate the variability of the process.

Misclassification
For the purposes of assessing the algorithms, a pixel-based assessment will be considered. Each pixel whose class is correctly determined by the segmentation algorithm will be counted as Correct, every pixel which the algorithm assigns a different class will be considered as Incorrect. Notice that since there is no foreground/background distinction but rather correct or incorrect, both True Positive (TP) and True Negative (TN) are included as correct, and False Positive (FP) and False Negative (FN) are included in the incorrect. Thus, the misclassification in percentage, or classification error, will be calculated as number of incorrect pixels divided by the total number of pixels of the image m = 100 * (FP + FN)/(TP + TN + FP + FN). The accuracy can be calculated as the complement a = 100 * (TP + TN)/(TP + TN + FP + FN).

Results
For each image, the networks were trained with the 3 different optimisation algorithms, 3 layer configurations and 4 epoch numbers, for a total of 36 different combinations. Thus for the 6 composite images there were 216 results. The misclassification of each segmentation was measured against the ground truth as the percentage of pixels classified incorrectly. These results are summarised in Table 1.
The best results for each image were selected and compared against traditional methodologies and are shown in Table 2. The results are illustrated graphically in two ways. Figure 6 shows segmented the classes overlaid as different colours over the original textured images. Figure 7 shows correctly segmented pixels in white and the misclassified pixels in black.  Table 2. Comparative misclassification (%) results with co-occurrence [5], best filtering result from Randen [54], p 8 and LBP [53], Watershed [51], Multiresolution sub-band filtering (MSBF) [8] and U-Net [46]. (Bold is the best for each image).

Discussion
The results provided by the U-Net algorithm provided interesting results in terms of the actual misclassification results against traditional algorithms, and the variability of the U-Net cases. The segmentation results provided by the U-Nets were better in four of the six images. In some cases, the results were very close to the second best option (a: 2.8/2.6, d: 7.3/7.1) and in two cases (e,f) traditional algorithms provided better results (e: 4.3/7.7, f: 17.0/17.5). The average for all the six composites was best for U-Nets, however, given the fact that the difference with the second best is relatively small (0.75), and that traditional algorithms provided better results in 1/3 cases shows that care should be taken when selecting algorithms. This is similar to the conclusion of Randen who stated that "No single approach did perform best or very close to the best for all images" [55].
In terms of the U-Net configuration there are several interesting observations. First, there was a great variability in the results produced by the different U-Net configurations. It was surprising that the maximum value of the misclassification in some cases was extremely high, 80% in the cases of 5 textures and 94% in the case of 16 textures, those cases are equivalent of selecting a single class for all textures. Second, three of the best results were obtained with 100 epochs, 2 with 10 epochs, and 1 with 50, which is counter-intuitive as it would be expected that longer training times would provide better results. Third, three of the best results were provided by RMSprop optimisation, two by Adam and one by sgdm. Fourth, and perhaps the most surprising result was that the results provided by the two 20 layer configurations were very different. In a few cases the result were equal (e.g., image c, sgdm, 10 epochs; image b, Adam, 10 epochs) but in others the variation was huge (e.g., image b, Adam, 50 epochs).
In terms of texture, it can be highlighted that not all textures are the same, the five textures of image (a) are far easier to distinguish and correctly segment than those of image (b) and image (f). The U-Net was capable of segmenting these textures with accuracy comparable or better than traditional techniques. As mentioned previously, the fact that the textures have been histogram equalised removes the discrimination of the regions by their average intensities. More complex architectures, e.g., Siamese Networks [57] could provide better results, but it is important to use a standard benchmark such as that provided by Randen [55].
There are many other configuration parameters that could be varied; learning rate, batch size, variations of the training data, different number of layers, but for the purpose of this work, the results show first, the capability of deep learning architectures for segmentation of textured images and second, in some cases better results than traditional methodologies. However, the configuration of the network is not trivial and variations of some parameters can provide sub-optimal results. The experiments conducted in this work did not provide conclusive evidence for the selection of any of the parameters evaluated. Furthermore, training of the networks requires considerable resources. The training times for the images with 5 textures took around 5 hours and for the image with 16 textures around 96 hours on a Apple (Cuppertino, CA, USA) Mac Pro (Late 2013) with a 3.7 GHz Quad-Core and 32 GB Memory with Dual AMD FirePro D300 graphics processors.
Therefore, it can be concluded that U-Net convolutional neural networks can be used for texture segmentation and provide results that are comparable or better than traditional texture algorithms. Furthermore, these results encourage the application of deep learning to other areas. If we assume that different textures are characterised by patterns, i.e., repetitions of certain sequences or particular variation of intensities, then any data which is characterised by patterns could be analysed. For instance, phonemes in human speech have different patterns, which when combined form words. Thus one line of an image with different textures would have similar characteristics as the intensity variation of a phrase with different phonemes. Moreover, voice signals, which are one-dimensional can be converted into two-dimensional spectrograms [66] with time on one axis and frequency in another axis. In these cases, the spectrograms can be analysed for texture directly.

Conflicts of Interest:
The authors declare no conflict of interest.