Segmentation of PMSE Data Using Random Forests

: EISCAT VHF radar data are used for observing, monitoring, and understanding Earth’s upper atmosphere. This paper presents an approach to segment Polar Mesospheric Summer Echoes (PMSE) from datasets obtained from EISCAT VHF radar data. The data consist of 30 observations days, corresponding to 56,250 data samples. We manually labeled the data into three different categories: PMSE, Ionospheric background, and Background noise. For segmentation, we employed random forests on a set of simple features. These features include: altitude derivative, time derivative, mean, median, standard deviation, minimum, and maximum values corresponding to neighborhood sizes ranging from 3 by 3 to 11 by 11 pixels. Next, in order to reduce the model bias and variance, we employed a method that decreases the weight applied to pixel labels with large uncertainty. Our results indicate that, ﬁrst, it is possible to segment PMSE from the data using random forests. Second, the weighted-down labels technique improves the performance of the random forests method.


Introduction
Polar Mesospheric Summer Echoes (PMSE) are radar echoes that form at about 75 to 95 km altitude at polar latitudes during the summer months. A recent long-term study of observations at 53.5 MHz made over two decades at 69 N (and many others) showed that they appear between mid-May and the end of August and are most likely to appear in June and July with an average occurrence of 95 percent [1].
Formation of PMSE requires the presence of turbulence, free electrons and charged aerosols. The charged aerosols themselves contain water ice and require the presence of very low temperature, the adequate water vapor [1][2][3], and nucleation centers to facilitate heterogeneous condensation. Meteor Smoke Particles (MSP) have been identified as the likely condensation nuclei. They result from meteor ablation and recondensation. In addition to them, the water vapor and the cold temperature at mid and high latitudes at the mesopause during the summer months allow the ice particles to form [4]. The combination of neutral air turbulence and the effect of negatively charged ice particles result in irregularities in the electron density distribution which generates the observed radar echoes, see, e.g., [1].
PMSE and Noctilucent Clouds (NLC) are observed during a similar time of the year and at similar heights and observations showed that the NLC tended to appear at the bottom of PMSE [2]. PMSE and NLC have the potential to reveal details about the atmosphere, including many changes during recent decades. An increase of NLC occurence over the years has already been noticed in observations from 1964 to 1988 [5] and one could argue that climate change may have reached the edge of space. To better understand this, systematic studies of PMSE over time can be helpful because they reveal the existence of water ice particles at the height where they are observed.
We aim to develop a method to investigate the thickness of PMSE, their shape, and the variation of PMSE height with time over the years. This requires the classification

Random Forests
A way to characterize and segment data is to use a decision tree. A decision tree is represented as a directed graph G = (V, E), E ⊂ V 2 , where V is a finite set of nodes split into three disjoint sets V = D ∪ C ∪ T, where D is decision node, C is chance node, and T is terminal or end node [11]. The different nodes represent different phases of a decision problem sequence [11]. In a decision node, based on observations about an item, we select an action. In Figure 1, there are two edges (d 1 , c 1 ) and (d 1 , d 2 ) originating from decision d 1 and one of edges lead to another decision node d 2 . In a chance node which represents the probability of an outcome, we again select an edge randomly. In Figure 1, there are two edges (c 1 , t 1 ) and (c 1 , t 2 ) originating from chance node c 1 and two edges (c 2 , t 3 ) and (c 2 , t 4 ) originating from chance node c 2 . Terminal or end nodes (t 1 , t 2 , t 3 , and t 4 ) represent the outcome of a sequence of actions. For instance, it could be an items target value (regression) or its category (classification).
Several decision tree algorithms exist in the literature. Algorithms such as: ID3 [12], C4.5 [13] and C5.0 [14] employ information gain (which uses the concept of entropy) for deciding on the features to use for split at each step in building the tree, whereas another decision tree algorithm, e.g., CART [15] uses Gini impurity for the splitting criterion.
Decision tree-based methods are easy to understand and interpret; however, they are not robust. For instance, a small change in data or noise in the features can lead to large change in the tree and their associated outputs [16]. This implies that decision tree might not generalize well for unseen data.
Random forests is a decision tree-based ensemble learning method that has several advantages, such as having a built-in estimate of generalization error, depending only on one or two tuning parameters, and providing a measure of importance of different features of data [17]. Random forests use Breiman's bootstrap aggregation or bagging technique in which several individual decision trees are trained on different subsets of the training dataset, also known as random sampling with replacement [17]. Furthermore, random forests use random subsets of available features for building the individual trees, also known as feature bagging. In the study by Probst et al. [18], it is suggested that the number of features to be randomly selected (mtry) for classification tasks usually have a default value of √ p, where p is the total number of features. However, mtry can be increased from its default value to improve the probability that at least one of randomly selected features is a strong predictor [18,19]. For regression, generalization error is measured using out-of-bag mean square error as: wheref oob (x i ) is out-of-bag prediction for bootstrap sample i, y i is its actual outcome [17], and N is the number of samples. For classification, generalization error is measured using the out-of-bag error rate as: i.e., it assigns 0 to error for a correct classification and 1 for an incorrect classification [17]. For example, if a PMSE sample is misclassified as a Noise sample, 1 is added to the sum in the out-of-bag error rate. If the PMSE sample is correctly classified as PMSE, 0 is added to the sum in the out-of-bag error rate. When we apply these equations to our data, x i corresponds to a given sample i of our dataset. Then, f (x i ) is the predicted label by the model for this given sample i, while y i is the actual label of this sample.f oob (x i ) means that we consider the predicted label by the model for the sample x i in the out-of-bag dataset. Finally, I(y i =f oob (x i )) is a measure of how close y i is tof oob (x i ). It is a loss function defined to minimize the expected value of the loss. Bootstrapping ensures that individual decision trees are unique, which reduces the overall variance of the random forests method [20]. Finally, the prediction is obtained by aggregating the decisions of individual trees in the case of regression, or by taking the majority vote in the case of classification.
The importance of features is calculated based on the permutation importance method proposed by Breiman [21]. It is calculated as follows: first, we use out-of-bag samples for estimating the predictions from each tree with a selected feature f [17]. Second, the feature f is randomly permuted in out-of-bag samples and their predictions are calculated. Third, we calculate the difference between the prediction scores for the permuted and the original. Four, the average of differences over all trees within the random forest is an estimate of importance of the feature f .
Random forests methods are fast, simple, and easy to interpret via permutation importance [22]. They have been used in several applications such as: pattern recognition [23], object detection [24], remote sensing [25], and image segmentation [26].

Weighted-Down Technique
In their paper [6], Almeida et al. propose a novel method that can reduce both model bias and model variance. In their method, first, estimation of the pixel-wise label uncertainty of training data is performed. For instance, given a sample x i with label y i , its neighborhood uncertainty score is calculated as: where C is the number of classes, k is the number of neighbors we consider for each sample (we employed k = 11 for our experiments), k y i is the number of neighbors with same label as x i , k j is the number of neighbors with class label y j , and d x i represents a vector with normalized distances to the k y i neighbors with the same class label as x i . For more details on the significance of the terms in the numerator and denominator of Equation (3), we refer the reader to the original paper [6]. Next, the training sample weights are adjusted such that the samples with high uncertainty are weighted-down and those with low uncertainty are weighted-up.

Metrics for Evaluation of Performance
In this section, we briefly discuss the different metrics used for evaluating the performance of methods used for data segmentation.

Classification Error
The classification error E is defined as the ratio of the number of misclassified samples, i.e., sum of the False Positives and False Negatives to that of the total number of samples. The values of classification error are in the range [0,1], where values closer to zero indicate fewer misclassifcations, hence better performance.

Logarithmic Loss
Logarithmic loss L is based on the predicted class probabilities and is considered to be a more refined metric than classification error [18,27]. Logarithmic loss is defined as: where n is number of samples, the weight for observation j is w j and the weights are normalized to sum to 1, and m j is scalar classification score [28].

Area Under ROC Curve (AUC)
A receiver operating characteristic (ROC) curve is a simple and visual way to summarize the performance of a classifier [29]. Assuming a two class prediction problem where the output is either positive or negative, an ROC curve is created by plotting true positive rate against false positive rate.
For the test samples, true positive rate is defined as the ratio of number of correct positive outcomes to that of all the positive samples [29]. The false positive rate is defined as the ratio of number of incorrect positive outcomes to that of all negative samples [29]. Finally, the area under the ROC Curve (AUC) gives us a scalar value in the range [0,1], which is used to measure the performance of a classifier. A random guessing classifier can give an AUC of 0.5; hence, any realistic classifier should have an AUC value more than 0.5 [29].

Method
In this section, we discuss the dataset used for our experiments. Next, we briefly explain the labeling procedure and the weighted-down technique employed in this study. After that, we explain the options used for random forests and finally the feature extracted from the data.

Dataset
The data we used for the analysis comes from EISCAT VHF radar located near Tromsø, Norway. The images contain measured backscattered power as a function of altitude and time. We use a height range of 75 to 95 km for our analysis, and observations typically last several hours. The height resolution is 0.30 to 0.45 km and the time resolution is of approximately one minute. We downloaded the data written in ASCII format from the Madrigal website. For further information about the 30 observation days used in this study, their dates and times are listed in Table 1.

Labeling
We labeled the data manually and pixel by pixel, using the built-in Matlab Image Labeler App. In this way, a given labeled pixel belongs to one of three classes of interest, namely, PMSE, Ionospheric background, and Background noise. The regions of interest that are considered in this paper are discussed in more detail in our previous work [30]. The PMSE is characterized by a region where coherent scattering occurs, whereas the Ionosphere is a region where incoherent scattering occurs. The Noise region also displays incoherent scatter but because the signal is low the region has a lot of missing values (NaNs). As a result, it makes this region look different compared to the Ionosphere. The labeling was performed by visually recognizing a PMSE pattern. We based this on the fact that the amplitude of the PMSE looks greater than its surroundings, and it has a particular wavy structure that makes it different from the background. Figure 2 shows an example of the manual labeling process for a given image. We represented the original image as a heatmap, where the blue pixels represent the minimum values, the red pixels represent the maximum values, and the other colored pixels represent the values in between minimum and maximum. We use this same color code also later in this paper. This refers to the equivalent electron density from the standard GUISDAP analysis [31]. As for the labels part of Figure 2, the cyan colored pixels represent the Background noise, the yellow colored pixels belong to the Ionospheric background, and the dark red colored pixels represent the PMSE. Finally, the dark blue colored pixels represent unlabeled data. We partially labeled 18 images out of the 30, that contain a total of 56,250 samples (pixels). We used 60 percent of the labeled data for training (33,750 samples) and 40 percent for quantitative testing (22,500 samples). For qualitative testing, we used all the images. In addition, qualitative analysis was made by visual inspection of the segmented images by a domain expert. In the labels part, the cyan, yellow and dark red color represent, respectively, the Background noise, Ionospheric background, and PMSE classes. The dark blue color represents unlabeled data.

Labels with Reduced Weighting
Next, we use the weighted-down technique described in Section 2.2 that aims to reduce both model bias and variance by reducing the weighting for pixel labels with large uncertainty and increasing it for labels with small uncertainty. For this, as a first step on the manually labeled data, for instance as shown in Figure 3a, we apply edge erosion to obtain a set of pixel labels that should be given lesser weight. As the labels we used in our experiment do not overlap, the edge erosion step generates a set of pixel labels along the label boundaries, as shown in Figure 3b. Finally, we calculate uncertainty scores based on Equation (3) in Section 2.2 for these pixel labels, and any pixel labels with non zero uncertainty scores are not kept for further analysis. Figure 3c shows the pixel labels after removing pixel labels with uncertainty, and Figure 3d shows the pixel labels that are removed from further analysis. shows the contours of the labels, (c) shows the image after removing the labeled pixels using weighted-down labels technique, and (d) shows the removed labeled pixels. For all four images, the red colored pixels belong to the PMSE class, the yellow pixels represent the Ionospheric background, and the cyan pixels belong to the Background noise class. Finally, the dark blue pixels represent unlabeled data. All of the plots have the same axes: the horizontal axis represents the time which starts at 8:00 UTC and finishes at 12:00 UTC. The vertical axis represents the altitude which ranges from 75 km to 95 km. The chosen observation day is 30 June 2008.

Random Forests Application
For our experiments, we employ random forests for training and evaluating the performance of data segmentation. For this we use MATLAB. The training is performed on an ensemble of bagged classification trees where number of trees in the forest is 500. In line with the study by Probst et al. [18] the number of trees is kept high, i.e., 500, samples are drawn with replacement, and p-value is used as the splitting rule. In addition, we enable surrogate decision split in order to allow the random forests to make a decision in case of missing data. This is done to accommodate for instances where we obtain missing amplitude values, i.e., NaNs in the data.

Feature Extraction
For each pixel, we extracted a dataset of features which is used as input together with its label. For each pixel we calculate features such as: mean, standard deviation, median, minimum and maximum values associated with neighborhoods ranging in sizes 3 × 3, 5 × 5, 7 × 7, 9 × 9, and 11 × 11 pixels where the pixel is at the center. In addition, we compute vertical and horizontal Gradient magnitudes using Sobel kernels for filter sizes 3 × 3, 5 × 5, 7 × 7, and 9 × 9, see [32]. Horizontal gradient operators calculate the time derivatives, and vertical gradient operators calculate the altitude derivatives. Furthermore, for each pixel its altitude and amplitude are included as features. This generates a feature vector with 35 dimensions. We plotted the different features extracted from the data in Figure 4 for the given observation day, 30 June 2008, except for altitude which is not illustrated in this figure. In the figure, the image in the first row and column represents the normalized amplitude. Then, from left to right and from top to bottom, the four next images represent the vertical gradient magnitudes for filter sizes of 3 × 3, 5 × 5, 7 × 7, and 9 × 9 pixels. The following four images represent the horizontal gradient magnitudes for the same filter sizes. The next five images represent the mean values for filters filter sizes of 3 × 3, 5 × 5, 7 × 7, 9 × 9, and 11 × 11 pixels. In a similar way, we represent the median values, the standard deviation, the minimum values, and finally the maximum values for the same filter sizes ranging from 3 × 3 to 11 × 11 pixels.

Results
The performance of random forests based on segmentation methods using different combinations of features, i.e., filter sizes, number of randomly selected features mtry, can be seen in Tables 2 and 3. While the results from using original labels are outlined in Table 2, the results we obtained using weighted-down labels (discussed in Section 3.3) are shown in Table 3. We evaluate the performance in terms of classification error for the test data, classification error for the out-of-bag (OOB) data samples, logarithmic error for the test data, logarithmic error for the out-of-bag (OOB) samples, Area under the ROC Curves (AUCs) for PMSE, Ionospheric background, and Noise. Note that for each filter size in the table, we use the following features: altitude derivative, time derivative, mean, median, standard deviation, minimum, and maximum. Additionally, we also use other features such as altitude and amplitude. This means that for each filter size, the feature vector has nine dimensions. At the filter size 11 × 11, the gradient filter is 9 × 9. For one filter size, e.g., 3 × 3 we use mtry = 3, 6, 9. After that, we use all filter sizes and select mtry = 5, 10, 15, 20, 25, 30 and 35, where 35 is the total length of the feature vector obtained from using all filter sizes.
The results for original labels can be seen in in Table 2 and the results for weighteddown labels can be found in Table 3. In both cases, the logarithmic error and logarithmic error OOB have the best performance for a filter size of 7 × 7 pixels with mtry = 9, and the worst one for all filter sizes and an mtry = 5. The classification error and the classification error OOB have the best results for the filter size 3 × 3 associated with mtry = 3 for original labels, or the filter size 7 × 7 associated with mtry = 9 and the filter size 11 × 11 with mtry = 6 for the weighted-down labels. The worst performance on the other hand was obtained in the case using all filter sizes and mtry = 5, and the filter size 9 × 9 in combination with mtry = 3. Almost all the AUC curves had the best performance for the combination of the filter size 6 × 6 with an mtry = 6. The only exception is the AUC Ion. Back. metric with original labels, for which the best performance was obtained for a filter size of 5 × 5 pixels with mtry = 6. We see that for all filter sizes, i.e., 35-dimensional feature vector and mtry = 5, 10, 15, 20, there are slight improvements in the performance for the scores associated with different evaluation metrics. This can possibly indicate that the performance of the random forests algorithm benefits from multi-resolution features extracted using the different filter sizes. However, the performance decreases when all the filter sizes are used and an mtry equal to 35.
Based on the results obtained in Tables 2 and 3, we choose the filter size 7 × 7 and mtry = 9 for qualitative analysis. We used this classification model on 30 images, out of which 12 were new data for the model. Figures 5-8 show the predicted labels for the classification model (with a filter size 7 × 7 and mtry = 9). In all four cases, the prediction of PMSE labels by the model looks poor. Figure 9 shows the predictor importance for both original (a) and weighted-down (b) labels. We can see that in both cases, the altitude is clearly dominating over the other features. The importance value in the vertical axis is using an arbitrary scale, and the results were averaged over 10 iterations. The error bars represent one standard deviation from the average. For qualitative testing, the predicted labels were generated for all the 30 test images and for all the cases (all filter sizes and all mtry values) shown in both Tables 2 and 3. Although in Tables 2 and 3, the values of classification error, classification error OOB, logarithmic error and logarithmic error OOB are worse for all filter sizes and mtry = 5, these parameters gave us the best predicted labels. To illustrate our qualitative analysis, we use four examples. The predicted labels that we observed were the best (all filter sizes and mtry = 5) are shown in Figures 10-13. These figures show the predicted labels for the same time and dates as Figures 5-8. Figures 14 and 15 show the corresponding predictors importance for, respectively, original labels and weighted-down labels in the case of using all filter sizes together with mtry = 5. The results were averaged over 10 iterations, and the error bars represent one standard deviation from the average. Table 2. Results of the classification using original labels. These values are obtained after five iterations of each experiment. Each field contains the mean over these five iterations, followed by one standard deviation.  Table 3. Results of the classification using weighted-down labels. These values are obtained after five iterations of each experiment. Each field contains the mean over these five iterations, followed by one standard deviation.  5. Results of segmentation using the random forests method with the 7 × 7 filter size, weighted-down labels, and mtry = 9. The weighted-down labels technique is taken from the study by [6]. The data are obtained from the observation day, 17 July 2009. The image on the top illustrates the original image where the color scale represents the equivalent electron density to the power of 10, per cubic meter. The image at the bottom represents the predicted labels. Yellow, cyan and dark red represents, respectively, the region of the image labeled as Ionospheric background, Background noise, and PMSE. The horizontal axis on both images represents the time which starts at 8:00 UTC and finishes at 12:00 UTC. The vertical axis represents the altitude which ranges from 75 km to 95 km. Figure 6. Results of segmentation using the random forests method with the 7 × 7 filter size, weighted-down labels, and mtry = 9. The weighted-down labels technique is taken from the study by [6]. The data are obtained from the observation day, 8 July 2010. The image on the top illustrates the original image where the color scale represents the equivalent electron density to the power of 10, per cubic meter. The image at the bottom represents the predicted labels. Yellow, cyan and dark red represents, respectively, the region of the image labeled as Ionospheric background, Background noise, and PMSE.The horizontal axis on both images represents the time which starts at 9:00 UTC and finishes at 13:00 UTC. The vertical axis represents the altitude which ranges from 75 km to 95 km.

Figure 7.
Results of segmentation using the random forests method with the 7 × 7 filter size, weighted-down labels, and mtry = 9. The weighted-down labels technique is taken from the study by [6]. The data are obtained from the observation day, 7 July 2010. The image on the top illustrates the original image where the color scale represents the equivalent electron density to the power of 10, per cubic meter. The image at the bottom represents the predicted labels. Yellow, cyan and dark red represents, respectively, the region of the image labeled as Ionospheric background, Background noise, and PMSE.The horizontal axis on both images represents the time which starts at 00:00 UTC and finishes at 22:00 UTC. The vertical axis represents the altitude which ranges from 75 km to 95 km. Figure 8. Results of segmentation using the random forests method with the 7 × 7 filter size, weighted-down labels, and mtry = 9. The weighted-down labels technique is taken from the study by [6]. The data are obtained from the observation day, 30 June 2008. The image on the top illustrates the original image where the color scale represents the equivalent electron density to the power of 10, per cubic meter. The image at the bottom represents the predicted labels. Yellow, cyan and dark red represents, respectively, the region of the image labeled as Ionospheric background, Background noise, and PMSE.The horizontal axis on both images represents the time which starts at 8:00 UTC and finishes at 12:00 UTC. The vertical axis represents the altitude which ranges from 75 km to 95 km.

Figure 9.
Predictor importance for random forests used on original (a) and weighted-down (b) labels with a 7 × 7 filter size and mtry = 9. The horizontal axis lists all the predictors, and the vertical axis shows their importance using an arbitrary scale. Higher values mean that the algorithm assigned to them a higher importance to classify the data efficiently. These values are averaged over 10 iterations, and the error bars represent one standard deviation from the average. Figure 10. Results of segmentation using the random forests method (mtry = 5) and weighted-down labels technique from the study by [6]. The data are obtained from the observation day on the 17 July 2009. The image on the top illustrates the original image where the color scale represents the equivalent electron density to the power of 10, per cubic meter. The image at the bottom represents the predicted labels. Yellow, cyan and dark red represents, respectively, the region of the image labeled as Ionospheric background, Background noise, and PMSE.The horizontal axis on both images represents the time which starts at 7:50 UTC finishes at 12:00 UTC. The vertical axis represents the altitude which ranges from 75 km to 95 km. Figure 11. Results of segmentation using the random forests method (mtry = 5) and weighteddown labels technique from the study by [6]. The data are obtained from the observation day, 7 July 2010. The image on the top illustrates the original image where the color scale represents the equivalent electron density to the power of 10, per cubic meter. The image at the bottom represents the predicted labels. Yellow, cyan and dark red represent, respectively, the region of the image labeled as Ionospheric background, Background noise, and PMSE.The horizontal axis on both images represents the time which starts at 00:00 UTC and finishes at 22:00 UTC. The vertical axis represents the altitude which ranges from 75 km to 95 km.

Figure 12.
Results of segmentation using the random forests method (mtry = 5) and weighteddown labels technique from the study by [6]. The data are obtained from the observation day of 8 July 2010. The image on the top illustrates the original image where the color scale represents the equivalent electron density to the power of 10, per cubic meter. The image at the bottom represents the predicted labels. Yellow, cyan and dark red represents, respectively, the region of the image labeled as Ionospheric background, Background noise, and PMSE. The horizontal axis on both images represents the time which starts at 09:00 UTC and finishes at 13:00 UTC. The vertical axis represents the altitude which ranges from 75 km to 95 km. Figure 13. Results of segmentation using the random forests method (mtry = 5) and weighteddown labels technique from the study by [6]. The data are obtained from the observation day, 30 June 2008. The image on the top illustrates the original image where the color scale represents the equivalent electron density to the power of 10, per cubic meter. The image at the bottom represents the predicted labels. Yellow, cyan and dark red represents, respectively, the region of the image labeled as Ionospheric background, Background noise, and PMSE.The horizontal axis on both images represents the time which starts at 8:00 UTC and finishes at 12:00 UTC. The vertical axis represents the altitude which ranges from 75 km to 95 km.
In Table 3, the evaluation scores for all combinations of filter sizes are slightly better than the evaluation scores in Table 2. This can imply that by employing weighted-down labels (discussed in Section 3.3), we can achieve a reduction in both model bias and variance, hence, leading to an improved performance. Although the performance gain achieved using weighted-down labels technique from the study by [6] is marginal, for further studies involving data labels beyond the three categories used in this paper, the gains achieved using the weighted-down labels technique could be significant.
When all 35 features are used as predictors, i.e., mtry = 35, one can see that their importance is varying with different filter sizes. We can note that when original labels are used for the random forests algorithm, the predictor importance is different (as shown in Figure 14) as compared to when we use the weighted-down labels (as shown in Figure 15). The importance of all the different predictors is plotted on an arbitrary scale which is linear and relative. One can note that in Figure 14, the most important feature is altitude which is something that is also used in practice to determine if a signal is PMSE. After that, the second most important feature is the 11 × 11 minimum value, followed by the 9 × 9 minimum value, and then the 11 × 11 mean value, and so on. The importance of features is similar for the weighted-down labels case as shown in Figure 15, where the first 6 predictors are the same. After that, the order is slightly different. This implies that features extracted across multiple scales, i.e., ranging from 3 × 3 to 11 × 11 can play an important role in improving the prediction of the random forests. Finally, one can see that in Figure 15, the error bars are slightly smaller using weighted-down labels compared to Figure 14, where original labels are used. Predictor importance for random forests used on original labels with all filter sizes and mtry = 5. The horizontal axis lists all the predictors, and the vertical axis shows their importance using an arbitrary scale.Higher values mean that the algorithm assigned to them a higher importance to classify the data efficiently. These values are averaged over 10 iterations, and the error bars represent one standard deviation from the average. Figure 15. Predictor importance for random forests used on weighted-down labels with all filter sizes and mtry = 5. The horizontal axis lists all the predictors, and the vertical axis shows their importance using an arbitrary scale. Higher values mean that the algorithm assigned to them a higher importance to classify the data efficiently. These values are averaged over 10 iterations, and the error bars represent one standard deviation from the average.

Discussion
In the case with the filter size 7 × 7 and mtry = 9 which quantitatively gives the best results, the segmentation results as shown in Figures 5-8 are worse. This could be due to its poor generalization to new unseen data. Based on the segmentation results obtained in Figures 10-13, we can note that using the random forests for all filter sizes with mtry = 5 and weighted-down labels technique from the study by Almeida et al. [6], it is possible to segment the data into the three different categories of interest. Furthermore, mtry = 5 is in line with the study by Probst et al. [18], which suggests that the recommended number of predictors is approximately equal to the square root value of the total number of features, which is 35 in our case.
In one of the images used for qualitative testing, for instance Figure 13, we notice an unusual case where part of the PMSE signal is not accurately segmented. We observe that some pixels were classified as PMSE at the border between the Ionospheric background and the Background noise. This happens around 11:00 to 12:00 UTC, while we can clearly see that there is no PMSE at that time on the original image above. We think this is the result of the fact that PMSE are usually having a pattern elongated horizontally. Because of this, the model might make the horizontal patterns stand out more, and therefore give more importance to the vertical gradients, hence to the altitude derivatives. This is something that Figures 14 and 15 confirm, where altitude derivatives were given more importance than the time derivatives. In the future, we aim to use this segmentation approach to extract the PMSE signal from the vast dataset of EISCAT observations in order to analyze in detail the structures of the PMSE signals and compare the PMSE signals from different time periods in the solar cycle.

Conclusions
This study outlines a framework to segment PMSE from the Ionospheric background and the Background noise in images obtained from EISCAT VHF radar data. We manually labeled the data into three different categories: PMSE, Ionospheric background, and Background noise, representing in total a dataset of 56,250 labeled samples. For segmentation, we employed random forests on a set of simple features. These features include: altitude derivative, time derivative, mean, median, standard deviation, minimum, and maximum values corresponding to neighborhood sizes ranging from 3 by 3 to 11 by 11. We also used the amplitude and altitude additionally as features. Next, we used a weighting-down technique on the data labels to reduce the model bias and variance.
First, our results show that it is possible to extract PMSE signal from the data, when using all sizes for feature extraction and mtry = 5. Second, by employing the weighteddown labels technique, we note an improvement in the performance of random forests.
For future studies, PMSE could be investigated over a broader dataset comprising several years of observations for one complete solar cycle. Information such as the thickness or shape of PMSE over the years could also be analyzed to gain further understanding of its origin and evolution.