Transfer Learning from Synthetic Data Applied to Soil – Root Segmentation in X-Ray Tomography Images

One of the most challenging computer vision problems in the plant sciences is the segmentation of roots and soil in X-ray tomography. So far, this has been addressed using classical image analysis methods. In this paper, we address this soil–root segmentation problem in X-ray tomography using a variant of supervised deep learning-based classification called transfer learning where the learning stage is based on simulated data. The robustness of this technique, tested for the first time with this plant science problem, is established using soil–roots with very low contrast in X-ray tomography. We also demonstrate the possibility of efficiently segmenting the root from the soil while learning using purely synthetic soil and roots.


Introduction
Deep learning [1] is currently used world-wide in almost all domains of image analysis as an alternative to traditional purely handcrafted tools.Plant science, in this context, is a promising area [2][3][4] for the application of deep learning for various reasons.Firstly, it involves many biological variables, such as growth, response to biotic and abiotic stress, and physiology.Secondly, plants display huge variability (e.g., size, shape, color) and the consideration of all these variables surpasses the human capacity for software development and response to the needs of plant scientists.Thirdly, thanks to phenotyping centers or the use of robots in the field, the throughput of image acquisition is relatively high, so the observed large population of plants can meet the needs of big data required for effective deep learning.While deep learning achieves excellent informational performance [5][6][7][8] in such regimes, at present it also comes with extremely high computation costs due to the long period of training executed on graphical cards.A striking fact when gazing at the first layers of deep neural networks is that these layers almost look like Gabor wavelets.While promoting a universal framework, these machines seem to systematically converge toward tools that humans have been studying for decades [9].This empirical fact is used by computer scientists in the so-called transfer learning [10] where the first layers of an already-trained network are re-used [11].Another current limitation to the application of deep learning is that this requires huge training data sets to avoid over-fitting.Such training data sets have to be annotated when supervised learning is targeted.At present, only very few of these sets are available in the image analysis community with respect to One of the problems of deep learning with the convolutional neural network (CNN) is that the learning phase, where the network undergoes weight modification, can be very time-consuming and may need a very large set of images.
In this article we use a common trick called transfer learning [10,21] to circumvent this problem.The idea is to use an already-trained CNN to classify images.This trained CNN has been trained for a classification application which is not the one we want to perform.To understand why this approach for the classification is however working, let us recall that a CNN has two functionalities: (1) it has modeled features through training; and (2) it classifies images given as input using these features.Only the first functionality of the pre-trained CNN is used in our transfer learning approach.This training-modeling phase of the CNN is realized beforehand with an existing very large database of images, manually classified into a large array of possible classes (e.g., dog races, guitar types, professions, and plants).By being built on this large database, the CNN is assumed to select a very good feature space because it is now capable of sorting very diverse images into very diverse categories.
The assumption is somehow grounded by the existence of common features (blobs, tubes, edges...) in images from natural scenes.The study of these common features in natural scenes is a well-established issue in computer vision, which has been investigated for instance in gray-level images [22,23], color images [24,25], and even in three-dimensional (3D) images [26].It is thus likely that, since our images to be classified share some common features with the images of the database used for the training of the CNN, the selected features will also operate efficiently on our images.Please note that we do not use the second functionality of the pre-trained CNN because this CNN was trained on classes which are probably not the ones we are interested in.We simply feed our computed features to a classic classifier such as the support vector machine (SVM).

Implementation
In this section we describe how, from the concepts shortly recalled in the previous section, we designed an implementation capable of addressing the soil-root segmentation in X-ray tomography images.

Application to Image Segmentation
Some recent techniques in deep learning enable a full classification of full images [27].In this article we consider the classification of each pixel of the image as being of the root or soil.However, the machine learning techniques presented in the previous section do not consider single scalars but images as input to realize a classification.Therefore, an idea is to consider for the classification of a pixel a small window, also called a patch, centered on this pixel.Each patch is a small part of the image, and its class ("soil/root") is the class of the central pixel around which it was generated.It is these patches which are given to the pre-trained CNN.Once the features are computed for each patch, these features are given to a classic classifier.There is of course finally a training phase where this classifier receives labeled patches (coming from a segmented image), and then a testing phase where the classifier predicts the patch class.

Algorithm
Our algorithm, based on transfer learning and shown as Algorithm 1, goes through three basic steps, both for training and testing: creation of patches around the pixels, extraction of features from these patches with a pre-trained CNN, and the feeding of these features to a SVM.The training image is labeled (each pixel is labeled "part of the object" or "part of the background"), and these labels are fed to the SVM along with the computed features to train it.The trained SVM is then capable of predicting pixel labels from the testing image's features.As underlined in Algorithm 1, the parameters to be tuned or chosen by the user rely on the size of the patch and the size of the training data set.

Material
The pre-trained CNN we used was developed by [28], trained on the Imagenet database.We used this CNN because it is one of the most general ones (some CNN were trained on more specific cases such as face recognition), and thus it can be expected to be more efficient for transfer learning in our problem.This network is composed of 22 layers and yields a total of 1000 features.Convolutional filters applied to the input image in the first layer of the CNN can be seen in Figure 1.These filters appear very similar to wavelets [29] oriented in all possible directions.This is likely to enhance blob-like or tube-like structures such as the tubular roots or grainy blobs of the soil found in our X-ray tomography.The classifier used was a linear SVM.It was chosen after comparing the performance on our data by cross-validation with all other types of classifiers available in Matlab.Computation was run on Matlab R2016A, on a machine with an Intel (R) Xeon (R) 3.5 GHz processor (Intel, Santa Clara, CA, USA), 32 GB RAM, and a FirePro W2100 AMD GPU (AMD, Santa Clara, CA, USA).
Algorithm 1 Proposed machine learning algorithm for image segmentation.All Computer tomographic (CT) data was measured with an individually designed X-ray system at the Fraunhofer EZRT in Furth, Germany using a GE 225 MM2/ HP source Aerotech axis system (Aerotech, Pittsburg, CA, USA) and the Meomed XEye 2020 Detector (MEOMED, Přerov, Czech Republic) operating with a binned rectangular pixel size of 100 µm.The source was operated at 175 kV acceleration voltage with a current of 4.7 mA.To further harden the spectra, 1-mm-thick copper pre-filtering was applied.The focus object distance was set to 725 mm and the focus detector distance to 827 mm, resulting in a reconstructed voxel size of 88.9 µm.To mimic the data quality typically occurring in high-throughput measurement modes, only 800 projections with a 350-ms illumination time were recorded within the 360 degrees of rotation.This resulted in a measurement time of only 5 min for scanning the whole field of view of about 20 cm.The pot used for the measurement was a PVC (Polyvinyl Chloride) tube with a 9-cm diameter, and only a small partial volume in the middle part of the whole reconstructed volume was used to reduce the simulation time.The roots used as the reference in experiment in Section 3 and as experimental data for the experiment in Section 4 were maize plants of the type B73.During the growth period, the plants were stored in a Conviron A1000PG growth chamber.The temperature within the 12 h period of light was 21 • C and 18 • C during the night.Two different soils were used for the two experiments.In the experiment shown in Section 3, the soils were the commercially available Vulkasoil 0/0,14 obtained from VulaTec in Germany.In the experiment in Section 4, the soils were the agricultural soil used in [19].Both soils were mainly mineral soils with a coarse particle size distribution.While the Vulkasoil in experiment of Section 3 resulted in a very low contrast with the root system, the high amount of sand in the agricultural soil sample increased the contrast visibly.

Segmentation of Simulated Roots
In this section, we designed a numerical experiment dedicated to the segmentation of simulated root systems after learning from other simulated root systems.The learning and testing images were both generated the same way: three-dimensional (244 × 244 × 26) pixels of real soil images with X-ray tomography.Root structure was generated from the L-system simulator of [30] under the form of [31] presented in 3D in [32].Simulated roots were added to the soil by replacing soil pixels with root pixels.The intensities of the roots and the soil were measured from a manual segmentation in real tomography images of maize in Vulkasoil.The estimated mean and standard deviation are given in Table 1.We simulated the roots with a spatially independent and identically distributed Gaussian noise of fixed mean and standard deviation.As visible in Figure 2, learning and testing images do not have the same soil, nor the same root structure.This experiment is interesting because the use of simulated roots enabled us to experiment various levels of contrast between soil and root.Also, since the L-system used was a stochastic process, we had access to an unlimited sized training or testing data set.It is therefore possible with this simulation approach to investigate the sensitivity of the machine learning algorithm of the previous section to the choice of the parameters (size of the patch, size of the training data sets...).

Nominal Conditions
As nominal conditions for the root/soil contrast, we considered the first-order statistics (mean and standard deviation) of the roots and soil given in Table 1 which corresponded to the contrast found in the real acquisition conditions of maize in Vulkasoil (as described in the Material section.As visible in Figure 2A,B, these are conditions which provide very low contrast. In the conditions of Table 1, with a combination of the information obtained with patches of 2 pixels and 15 pixels (see Section 4.2) and a training data set of 1000 patches, the segmentation performance obtained is given in Figure 3 with the confusion matrix in Table 2. To summarize the performance of the segmentation with a simple scalar, we propose a quality measure QM obtained by multiplying sensitivity (which proportion of root pixels were detected as such) and specificity (which proportion of detected root pixels are truly root pixels) with TP being true positives, FP false positives, and FN false negatives.The quality measure QM is maximized at 1 for perfect segmentation.For the segmentation of Figure 3, QM = 0.23.As visible in Table 2 and in Figure 3, the segmentation is not perfect, especially since false positives outnumber true positives.However, the quality of the segmentation cannot be fully captured by sole pixel to pixel average metrics.The spatial positions of false positive pixels are also very important.As visible in Figure 3, false positives (in yellow) are gathered just around the true positives and the small false positive clusters are much smaller than the roots.This means that we get a good idea of where the roots actually stand and with very basic image processing techniques such as particle analysis and morphological erosion, one could easily yield a much better segmentation result.Also, when the segmentation is applied on the whole 3D stack of images, it appears in Figure 3E,F, that the overall structure of the root system is well captured by comparison with the 3D structure of the ground truth shown in Figure 3B,C.It is useful to recall, while inspecting Figure 3E,F that the process classification is realized in a pixel by pixel two-dimensional (2D) process and it would also be possible to improve this result by considering 3D patches.
To obtain the three-dimensional (266 × 266 × 26 pixels) result of Figure 3E,F, about 2 h were necessary.The computing time was mainly due to the computation of the features (90%) on all the pixels of the testing image, while the other steps had negligible computation costs.Here, we considered the full feature space (1000 features) from [28].It would certainly be possible to investigate the possibility of reducing the dimensions of this feature space while preserving the performance obtained in Figure 3. Instead, in this study we investigated the robustness of our segmentation algorithm when the parameters or datasets depart from the nominal conditions exhibited in this section.
Table 2. Confusion matrix of results in nominal conditions, as shown in Figure 3.The total number of pixels was 1,784,744.

Robustness
A first important parameter for using our machine learning approach is the size of the learning data set.In usual studies based only on real data of finite size, the influence of the learning data set is difficult to study since increasing the learning data set necessitates a reduced test data set.
With the data from the previous section where roots are simulated, we do not have to cope with this limitation since we can generate an arbitrary large training data set.Figure 4 illustrates the quality of the segmentation obtained for different training data set sizes on a single soil-root realization.Degradation of the results is visible when decreasing the training size from 100 to 25 patches.As the root simulator is a stochastic process, the performance is also given in Figure 5 as a function of the size of the training data set in terms of box plot with average performance and standard deviation computed over five realizations for each training data set size tested.As visible in Figure 5, the average performance is almost constant, and the increase in the size of the training data set mainly provides benefits in terms of a decrease in the dispersion of the results.2).The image to segment measured 266 × 266, with 312 root pixels.The color code is the same as in Figure 3.
A second parameter of importance to use our machine learning approach is the size of the patch.As visible in Figure 6 left, decreasing the size of the patch produces finer segmentation of the roots and their surrounding tissue but also increases spurious false detection far from the roots.Increasing the patch size (see Figure 6 right), produces a good segmentation of roots, with very few false detections far from the roots but with over-segmentation on the tissue directly surrounding the root.An interesting approach could consist in combining the results produced from a small and a large patch by a simple logical AND operation, i.e. an intersection, which detects as roots only the pixels detected as roots for both sizes of patches.This was the strategy adopted in Figure 3, which removes false detection far from the root while preserving the fine detection of the root with little false detection in the tissue surrounding the roots.

Segmentation on Real Roots
In this section, we investigate the performance of our machine learning algorithm when applied to real roots.The contrast considered is higher than in the previous section (see detail in Section 2) and corresponds to that found in [20].We conducted the numerical experiment described in Figure 7 designed for the segmentation of real root systems after learning from simulated root systems and simulated soil.
We tried to limit the size of the set of parameters controlling the simulated root and soil.First, the shape of the root systems is not chosen realistically, but the typical size of the object is chosen similar to the size of the real roots to be segmented.Also, as in the nominal conditions section, we tuned the first-order statistics (mean and standard deviation) of the roots and soil given in Table 3 which corresponded to the contrast found in real acquisition conditions (Figure 7B).In addition, we tuned the second-order statistics of the simulated soil on real data sets.These second-order statistics were controlled by means of the algorithm described in Figure 8.By operating this way, we obtained the fairly well-segmented images shown in Figure 9, with a confusion matrix given in Table 4.The corresponding binary ground truth is given in white in panel (C).Root intensity values are generated from a white Gaussian probability density function with a fixed mean and standard deviation, and are then low-pass filtered (see Figure 8) until the autocorrelation of the root is similar to that real roots.The same goes for the soil.Panel (B) is a slice of the testing image, a real image of X-ray tomography reconstruction.Panel (D) shows approximate ground truth of (B) created manually.White noise is generated, then a low-pass filter is applied.This filter is a discrete cosine transform (DCT)->mask->DCT −1 transformation, where the size of the mask acts as the cutoff frequency of this filter.We start with a very high cutoff frequency (i.e., we do not change the white noise greatly).The autocorrelation matrix of the new image is then compared to that the real image, and the cutoff frequency is decreased if they are not similar enough, making the image "blurrier" and the autocorrelation spike wider.

Robustness
We have engineered the training data, generated from a simulation, to enable our machine learning algorithm to yield the good segmentation result of Figure 9.We found that the simulated training data must share some similar statistics with the real image for the segmentation to work.Specifically, it is sufficient to ensure the match of the first-order statistics (mean, standard deviation) and second-order statistics (autocorrelation) with the corresponding statistics of the image to be segmented.To test the robustness of this result, we have realized the same segmentation while changing one of these statistics in the training data, the other two staying similar to the testing data.Performance evolution can be found in Figure 10.As expected, segmentation is best when the training data resembles testing data.However, segmentation remains good when statistics are spread in a reasonable range around the optimal value.These ranges are visible in Figure 10.For example, Figure 10A shows that the mean of the root in the 80-120 range provides reasonably good results compared to the optimum case of 97 which corresponds exactly to the mean of the image to be segmented.This establishes conditions where it is possible to automatically produce efficient segmentation of soil and roots in X-ray tomography.Also, if one expects the statistics of real images to depart from the those that served in the training stage, it would be possible to normalize the testing data at the scale of patches in order to maintain them in the range of the expected statistics.

Conclusions, Discussion and Perspectives
As a conclusion, in this article we have demonstrated the value of deep learning coupled to a statistical synthetic model to address the difficult problem of soil-root segmentation in X-ray tomography images.This was obtained from the so-called transfer learning approach where a convolutional neural network is trained on a huge image data set, distinct from the soil-root, for classification purposes to select a feature space which is then used to train an SVM on the soil-root segmentation problem.We demonstrated that such an approach gives good results on simulated roots and on real roots even when the soil-root contrast is very low.We have discussed the robustness of the obtained results with respect to the size of the training data sets and the size of the patch used to classify each pixel.We illustrated the possibility to perform segmentation of real roots from training on purely synthetic soil and roots.This was obtained in stationary conditions where both soil and root could be approximated by their first-order and second-order statistics.
As a point of discussion for the bioimaging community, one could wonder about the interest of the proposed approach coupling transfer learning and simulation when compared to more classical machine learning approaches.For instance, one could consider the random forest available under the WEKA 3D FIJI Plugin [33] and which produces a supervised pixel-based classification.On one hand, in contrast to what we propose with transfer learning, random forest features are not automatically designed but have to be selected by the user.A default mode for the user can be to select all features made available in a given implementation.However this mode is the most expensive in terms of computation time.An optimal selection of features can save time but requires an expertise which may not be accessible to all users.On the other hand, random forest is known to be efficient when small annotated data sets are available, while transfer learning requires a comparatively larger data set due to the higher number of parameters to be optimized.Beyond this comparison, it is important to underline that the use of simulated data in supervised machine learning is not limited to transfer learning and can be applied with any kind of classifier.It would thus be possible to train a random forest with synthetic data under WEKA 3D FIJI to benefit from the automated annotation.
Several perspectives are presented in this work.This use of deep learning for the soil-root segmentation problem can serve as a reference to investigate more complex situations which are found in practice.The water density of root may not be constant along the root systems.The soil, because of gravity, is often not found to present the same compactness along the vertical axis.It would therefore be important to push forward the investigation initiated in this article in the direction of non-stationarity of the gray levels in the root and in the soil.Also, one could add a physical simulator to include typical artifacts due to X-ray propagation in an heterogeneous granular material.As another direction of modeling improvement, one could consider enriching the L-system (as in [34]) to provide further biological insight.
As another direction, the deep learning algorithm considered for this article was purposely chosen as a basic reference from 2014 [28] which of course can be considered as outdated given the huge research activity with respect to testing network architecture.The computation time of our algorithm was rather long here due to the fact that the testing stage, where each patch was classified, was not parallelized.With a parallelization of this task the process could easily be reduced to few minutes.Also, it could be possible to use other neural network architectures, including the autoencoder approach [27], where the classifier produces the segmentation in a single pass.Better performance would thus for sure be accessible for the root/soil segmentation following the global approach described in this article.
Instead of focusing on solely efficiency, the goal of this article was rather to demonstrate the interest in coupling transfer learning with simulated data.As demonstrated in this article, the use of simulated data offers the possibility to generate unlimited data sets and enables the control of all parameters of the data set.It is especially useful to establish conditions for which transfer learning can be expected to give good results with real soil-root segmentation.Also, thanks again to the use of simulated data which creates annotated ground truth, our approach could serve to present a comparison with classic image analysis methods for soil-root segmentation (for instance [20]) or with other deep learning-based algorithms.

1 :
CNN ← load(ImageNet.CNN); of training pixels = to be fixed by the user; patch size = to be fixed by the user;

Figure 1 .
Figure 1.Convolutional filters selected by the first layer of the chosen convolutional neural network (CNN).

Figure 2 .
Figure 2. Presentation of simulated root systems.Panel (A) gives a slice of the training image (including soil and roots).Roots were generated by simulating an L-system structure and replacing soil pixels with root pixels, with gray intensity levels drawn from a white Gaussian probability density function with fixed mean and standard deviation.The positions of root pixels are shown in white in panel (C) which acts as a binary ground truth.Panel (B) gives a slice of the testing image where roots were generated the same way, and have the same mean and standard deviation as in panel (A).The ground truth of panel (B) is given in panel (D).

Table 1 .
Mean (µ) and standard deviation (σ) values for the roots and soil in Figure 2. Images are coded on 8 bits.Training and testing images have the same statistics.Statistics Root µ Root σ Soil µ Soil σ

Figure 3 .
Figure 3. Experiments on simulated roots.Panel (A) gives a slice of the binary ground truth (position of the roots in white).Panels (B,C) provide a three-dimensional (3D) view of ground truth from two different standpoints.Panel (D) gives the result of the segmentation.Blue pixels signify true negative (soil pixels predicted as such), yellow represents false positive (soil pixels predicted as roots), orange signifies true positive (root pixels predicted as such), and purple shows false negative pixels (none in this result).Panels (E,F) are 3D views of the segmentation in the two viewing angles of panels (B,C).

Figure 4 .
Figure 4. Segmentation results for training sizes of 25, 100, 500, 1000, and 2000 patches, drawn using the 3D training image (see Figure2).The image to segment measured 266 × 266, with 312 root pixels.The color code is the same as in Figure3.

Figure 5 .
Figure 5. Quality of segmentation QM for training sizes of 25, 100, 500, 1000, and 2000 patches.As the root simulator is a stochastic process, the performance is given in terms of a box plot with average performance (red line), standard deviation (solid line of the box), and max-min (the "mustach" of the box) computed over five realizations for each training data set size tested.

Figure 6 .
Figure 6.From left to right segmentation results for patch sizes 5, 15, 25, and 31 pixels wide.Roots in the image to segment could have a diameter of between 10 and 15 pixels.The color code is the same as in Figure 3.

Figure 7 .
Figure 7. Experiment on real roots: Panel (A) shows a slice of the training image (simulated).The corresponding binary ground truth is given in white in panel (C).Root intensity values are generated from a white Gaussian probability density function with a fixed mean and standard deviation, and are then low-pass filtered (see Figure8) until the autocorrelation of the root is similar to that real roots.The same goes for the soil.Panel (B) is a slice of the testing image, a real image of X-ray tomography reconstruction.Panel (D) shows approximate ground truth of (B) created manually.

Figure 8 .
Figure 8.A pipeline for creating simulated texture with the same autocorrelation as real images.White noise is generated, then a low-pass filter is applied.This filter is a discrete cosine transform (DCT)->mask->DCT −1 transformation, where the size of the mask acts as the cutoff frequency of this filter.We start with a very high cutoff frequency (i.e., we do not change the white noise greatly).The autocorrelation matrix of the new image is then compared to that the real image, and the cutoff frequency is decreased if they are not similar enough, making the image "blurrier" and the autocorrelation spike wider.

Table 3 .Figure 9 .
Figure 9. Panel (A) shows ground truth (manually estimated).Panel (B) shows the segmentation obtained from our machine learning algorithm, with the parameters of training set size = 1000, patch = 12.The color code is the same as in Figure 3.

Table 4 .
Confusion matrix of the results of Figure9.Total number of pixels was 49.248 and the quality metric QM = 0.57.

Figure 10 .
Figure 10.Panel (A), top, shows evolution of the quality metric QM when the root intensity mean varies from 0 to 255 in the training data (s.d. and autocorrelation staying the same as testing data).The vertical blue line is the value of the root mean in the testing data (which we used for our results of Figure 9).The bottom part shows part of training images with different means as an illustration.Panels (B,C) show the same results for standard deviation and autocorrelation.Autocorrelation was quantified as the value of the cutoff frequency of the filter during the simulated image's creation.