On IoT-Friendly Skewness Monitoring for Skewness-Aware Online Edge Learning

: Machine learning techniques generally require or assume balanced datasets. Skewed data can make machine learning systems never function properly, no matter how carefully the parameter tuning is conducted. Thus, a common solution to the problem of high skewness is to pre-process data (e.g., log transformation) before applying machine learning to deal with real-world problems. Nevertheless, this pre-processing strategy cannot be employed for online machine learning, especially in the context of edge computing, because it is barely possible to foresee and store the continuous data ﬂow on IoT devices on the edge. Thus, it will be crucial and valuable to enable skewness monitoring in real time. Unfortunately, there exists a surprising gap between practitioners’ needs and scientiﬁc research in running statistics for monitoring real-time skewness, not to mention the lack of suitable remedies for skewed data at runtime. Inspired by Welford’s algorithm, which is the most efﬁcient approach to calculating running variance, this research developed efﬁcient calculation methods for three versions of running skewness. These methods can conveniently be implemented as skewness monitoring modules that are affordable for IoT devices in different edge learning scenarios. Such an IoT-friendly skewness monitoring eventually acts a cornerstone for developing the research ﬁeld of skewness-aware online edge learning. By initially validating the usefulness and signiﬁcance of skewness awareness in edge learning implementations, we also argue that conjoint research efforts from relevant communities are needed to boost this promising research ﬁeld.


Introduction
Skewness measures the asymmetry of a probability distribution against the normal distribution, which plays a crucial role in the data-intensive machine learning implementations. From the perspective of exploitation, skewness can act as one of the statistical and representative features of datasets for building machine learning models [1,2]. In a special case, the skewness values of different datasets are even utilized as a boundary detection reference to facilitate clustering [3]. From the perspective of adverse effects, high skewness in training datasets can result in poor learning models that will make wrong estimation in prediction tasks [4,5]. In fact, it has been identified that machine learning models can never work well with skewed data, no matter how comprehensive their parameter tuning is [6]. Therefore, it is valuable and often necessary for both practitioners and machine learning systems to be aware of data skewness, so as, for example, to mitigate the negative influence of skewed data on model training [7].
When it comes to calculating skewness, according to the modern software systems of statistics, the most widely employed method is the adjusted Fisher-Pearson skewness coefficient [8,9], as formulated in Equation (1).
• From the statistics community's perspective, this research enriches skewness monitoring solutions in the running statistics domain. It is worth highlighting that our solutions can conveniently be implemented into efficient algorithms and deployed on IoT devices for practical usage. In fact, compared with the pervasive discussions about mean, median, and variance (or standard deviation), skewness has been considered "a forgotten statistic" [8], not to mention running skewness. In particular, this research reminds the statistics community of the real-world needs of running skewness in the context of IoT and edge learning. • From the machine learning community's perspective, this research paves the way for developing skewness-aware online learning techniques, which is particularly valuable and useful in the edge computing domain. It should be noted that although traditional skewness-aware machine learning has been studied [5,19], their generic strategy of curing high skewness (i.e., data pre-processing) cannot be applied to the online mode. In particular, this research reminds the machine learning community of the significance and the available methods of online skewness monitoring. After all, the slightest difference at the model training stage can lead to a huge error in the machine learning results. • In addition, the shared formula derivation details (in Appendix A) can help both researchers and practitioners investigate other versions of running skewness and develop corresponding algorithms, so as to satisfy the diverse needs in practice.
It should be noted that not only has our work developed an efficient cumulative version of running skewness, but we have also investigated the other two versions that can satisfy the forgetting mechanism in the modern edge learning proposals. This contribution essentially follows the spirit of open methods that can act "as a higherlevel strategy over open-source tools and open-access data" to facilitate scientific work [20]. In fact, this research has revealed that the obscurity in some advanced statistical methods may hinder their recognition and employment (cf. Section 3.1.2). • Overall, our work advocates and fosters cross-community research efforts on IoTfriendly online edge learning. Given the booming of edge intelligence [21] accompanied with the resource limits of IoT equipment (e.g., the embedded RAM is only 4KB in the current taxi-mounted GPS devices), multi-domain knowledge and expertise need to fuse together to address the challenges of online edge learning. Through this paper, we particularly expect to attract more attention from the statistics community and to inspire more skewness monitoring algorithms and techniques for the various needs in the diverse edge environments.
The remainder of this paper is organized as follows. Section 2 briefly reviews the background of online machine learning in edge computing, and discusses the research gap in skewness-aware online learning. Section 3 specifies the IoT-friendly equation systems of the three versions of running skewness. For the ease of reading, their derivation details are separately shared in Appendix A. Section 4 particularly focuses on the cumulative version of running skewness and reports its algorithm performance evaluation, while the evaluation result is representative for all the three versions of running skewness. Section 5 initially validates the usefulness of applying skewness monitoring to online learning implementations. The conclusions are drawn in Section 6, together with our future work plans.

Related Work
The proliferation and interconnection of numerous data sources across various IoT devices are speeding up the digital explosion in our everyday lives [22]. For example, the International Data Corporation estimates that about 80 billion devices will be connected to the Internet by 2025, and as a result, the globally generated data will be as tremendous as 163 zettabytes [23]. Since it is often impractical and even impossible to centralize the explosive data due to the bandwidth limits and privacy concerns, some research organization have predicted that the majority of the data (over 90 percent) will be stored and processed locally in a distributed manner [24]. Correspondingly, along with the revolutionary development of ubiquitous mobile equipment, wearable gadgets and smart sensors, innovative data analytical solutions are increasingly developed and deployed on the edge of the Internet.
In particular, driven by the intensive interests in the "intelligent edge" [23], there is a clear trend in implementing machine learning techniques on various and distributed IoT devices on the edge. Although the implementation of edge machine learning is still at an early stage [25], it is evident that IoT and edge devices are complementary to cloud resources for scaling machine learning systems [24]. Furthermore, considering that the data in production for edge learning are generated gradually instead of being available all at once, the trained models must be updated "in an online fashion to accommodate new data without compromising its performance on old data" [26]. Thus, online machine learning has become a distinctive research topic in the edge learning field [27], which can be characterized by the real-time streaming processing computation model [28].
In fact, the revival of online learning in edge computing can be traced back to sequential learning, which was studied decades ago [29]. Nevertheless, as identified in [15], those early studies may still require (estimating) the samples' statistical information in advance for their sequential learning implementations. On the other hand, although recent studies have tried to simplify the online learning mechanisms, the solutions seem to be bypassing, ignoring, or tolerating the missing of the statistical characteristics of data, e.g., pre-setting suitable skewness values for simulations [30,31], using robust-to-skewness metrics to measure performance [32], etc.
It should be clarified that the machine learning community is well aware of the importance of skewness. For example, since canonical machine learning techniques assume balanced datasets [19], not only class skewness but also skewed features will bias the classification of dataset samples [33]. More generally, it has been identified that machine learning models cannot do a good job with skewed data, no matter how well the parameter tuning is conducted [6]. Unfortunately, to our best knowledge, little research into online machine learning has taken into account the real-time awareness of skewness. The study [34] is the only one we have found to monitor data skewness at runtime, which aims to balance the skewness of labels to train weak classifiers more equally. However, its skewness calculation is not compatible with the de facto adjusted Fisher-Pearson skewness coefficient (cf. Equation (1)). In addition, this use case only corresponds to the cumulative version of running skewness (cf. Section 3).
In contrast, this research investigates three running skewness versions and their calculation methods, by aligning with the adjusted Fisher-Pearson skewness coefficient. The proposed calculation methods can conveniently be converted into efficient algorithms and be implemented as skewness monitoring modules on IoT devices in order to satisfy the different needs of skewness-aware online edge learning.

Three Versions of Running Skewness and Efficient Calculation Methods
Given the continuous nature of online data flows, it is improper to assume online learning to be based on a complete data population. Therefore, we treat the processed data always to be samples, and correspondingly, we investigate the sample skewness instead of the population skewness of data. Furthermore, we take into account three versions of running skewness that may be the most useful ones for implementing skewness-aware online edge learning, such as the following: • Running Cumulative Sample Skewness recursively measures the skewness of the time series data filtered from the beginning of measurement up until the current moment. This skewness version is suitable for the situation when the individual data points in a time series are all equally important (e.g., [15]). • Running Rolling (Simple Moving) Sample Skewness recursively measures the skewness of the time series data within a fixed-size while continuously moving sample window. This skewness version is able to match the learning scenario in which each datum is valid only for a period of time (e.g., [35]). • Running Exponentially Weighted Sample Skewness also recursively and cumulatively measures the skewness of time series data. Different from the cumulative version, the data are accompanied with weighting factors that decrease exponentially as the data recede into the past. This skewness version would be more practical in the process industry [36] and in the financial sector, as the more recent data can better reflect the current system/market status.
To facilitate explaining and distinguishing between these skewness versions, we clarify the terminology used in this research. Firstly, we follow GNU's naming convention to emphasise the recursive feature of running statistics by highlighting the term "Running" [13], because "Moving" also means the moving activity and might result in confusions from time to time (e.g., [16]). Secondly, since "rolling window" is widely used when describing the moving activity in the literature, we employ the term "Rolling" to characterize the second skewness version. Thirdly, we refer to "exponentially weighted sample statistics" [37] to shorten the name of the third skewness version. Overall, for both reference convenience and distinctiveness, we use Cumulative Sample Skewness (CSS), Rolling Sample Skewness (RSS), and Exponentially Weighted Sample Skewness (EWSS), respectively, to indicate the three running skewness versions in the rest of this paper.
Furthermore, to facilitate reading and verifying the equations presented in the rest of this paper, we summarise the relevant notations in Table 1. Table 1. Notations used in the rest of this paper.

Notation Explanation
n Amount of observations (samples) within a time series. n k The most recent k items from a time series of n samples, and n k > 0.  x n k Sample mean of the most recent k out of n observation values x n−k+1 , x n−k+2 , . . . , x n .
x w(n) Sample mean of a time series of n observation values x 1 , x 2 , . . . , x n with exponentially decreasing weights. α, β constant coefficients between 0 and 1 for discounting the contribution of older data. S n Auxiliary variable that carries the crucial information of the third Central Moment. S n k Auxiliary variable that is associated with, and is to update, S n k .
V n Auxiliary variable that carries the crucial information of the second Central Moment. V n k Auxiliary variable that is associated with, and is to update, s 2 n k .

C2, C3
Auxiliary variables (similar to V n and S n ) that are used in the Go Library.

M2, M3
Auxiliary variables (similar to V n and S n ) that are used in GNU Scientific Library.

Cumulative Sample Skewness (CSS)
In this research, the efficient calculation of CSS relies on the efficient intermediate steps for calculating running mean and running variance. To make the reasoning process clear to readers, this paper also includes and explains cumulative sample mean (CSM) and cumulative sample variance (CSV).

Calculation of Cumulative Sample Mean (CSM)
Following the previous naming convention, we define CSM as implying the recursive calculation of the arithmetic average of a random sample from some population, as shown in Equation (2) wherex n represents the sample mean of a time series of n observation values x 1 , x 2 , . . . , x n , whilex n+1 indicates the updated sample mean after the new value x n+1 joins the observation time series. Although the derivation of Equation (2) is straightforward, we still briefly report it in this paper (cf. Appendix A.1) for the purpose of completeness.

Calculation of Cumulative Sample Variance (CSV)
In practice, standard deviation is commonly used to measure how far a set of data spread out from their average value. In order to make the derivation concise, this research focuses on sample variance instead of sample standard deviation, as shown in the equation system (3). Note that the sample standard deviation can conveniently be obtained by taking the square root of the sample variance.
where s n is the sample standard deviation of n observation values x 1 , x 2 , . . . , x n . The sample variance s 2 n is defined as a proportional function of the auxiliary variable V n that can be calculated in a recursive way, while V n essentially carries the crucial information of the second Central Moment, i.e., ∑ n i=1 (x i −x n ) 2 . Despite the usage of different notations, Equation System (3) is the same as the algorithm proposed by B. P. Welford in 1962 [38]. However, Welford's algorithm does not seem to be widely recognised and employed, possibly because "it is not obvious that the method is correct even in exact arithmetic" [39]. As a result, the straightforward but inefficient formula of recursive variance is still frequently shared in the community [40][41][42]. Therefore, we have managed to go through the derivation of Equation System (3) and shared the details as proof in Appendix A.2.

Calculation of Cumulative Sample Skewness (CSS)
By trying to minimize the amount of multiplication and exponentiation from the coding's perspective, we developed Equation System (4) to calculate CSS.
where S n is the sample skewness of n observation values x 1 , x 2 , . . . , x n . In this equation system, S n is defined as a proportional function of the auxiliary variable S n that can be calculated in a recursive fashion, while S n essentially carries the crucial information of the third central moment, i.e., ∑ n i=1 (x i −x n ) 3 . It should be noted that (x n+1 −x n ) and (V n+1 − V n ) can directly be obtained from the intermediate results in Equation (2) and Equation System (3), respectively. To our best knowledge, this is the most efficient way to calculate CSS, as derived in Appendix A.3 and validated in Section 4.3.

Rolling Sample Skewness (RSS)
The calculation of RSS proposed in this research is based on the same derivation logic of CSS. Here we also include and specify the formulas of rolling sample mean (RSM) and rolling sample variance (RSV) to facilitate sharing the proof details of RSS. In particular, we introduce a special symbol n k to represent the most recent k items from a time series of n samples.

Calculation of Rolling Sample Mean (RSM)
Given the rolling window size k for a time series of observation values x 1 , x 2 , . . . , x n , x n+1 , their RSM can be calculated via Equation (5).
wherex n k stands for the sample mean of the most recent k observation values x n−k+1 , x n−k+2 , . . . , x n , whilex n+1 k indicates the updated sample mean after rolling the observation window to include the new value x n+1 while removing the oldest one x o (i.e., x n−k+1 ). The brief derivation of Equation (5) is shared in Appendix A.4.

Calculation of Rolling Sample Variance (RSV)
Here we adapt Welford's algorithm [38] of CSV (cf. Section 3.1.2) to the rolling window scenario. It is noteworthy that although the adaptation of Welford's algorithm has been discussed in the open source community [43], there does not seem to be any consensus answer yet. The main reason could still be the little disclosure of derivation details of Welford's algorithm.
Following the previous notations, we let RSV be recursively calculated via Equation System (6). To our best knowledge, this is the most efficient approach to calculating RSV, as it requires less multiplication operations than the other methods [44].
where s 2 n k denotes the sample variance of the most recent k observation values x n−k+1 , x n−k+2 , . . . , x n ; and it can recursively be updated through its corresponding auxiliary variable V n k by rolling the observation window to include new values while removing the oldest ones. The derivation details of Equation System (6) are shared in Appendix A.5.

Calculation of Rolling Sample Skewness (RSS)
Through the derivation specified in Appendix A.6, we define the Equation System (7) to calculate RSS.
where S n k represents the sample skewness of the most recent k observation values x n−k+1 , x n−k+2 , . . . , x n . When the observation window keeps rolling forward, the continuous update of S n k will be realized through the recursive calculations of the auxiliary variables V n k and S n k . It should be noted that compared with the other peer formula forms, the multiplication operations in Equation System (7) have been minimized. When implementing algorithm from this equation system, the item 2x n+1 k should be further expanded, so as to reuse the intermediate results,

Exponentially Weighted Sample Skewness (EWSS)
This research derives the calculation method of EWSS by referring to the formula form of CSS, i.e., Equation System (4). In other words, different formula forms of CSS will result in different calculation methods of EWSS. Given the efficiency in the CSS calculation (cf. Section 3.1.3), we claim that the EWSS method in this research is also efficient.
In particular, we argue for the relationship between exponentially weighted sample statistics and cumulative sample statistics by reusing the well-known derivation logic of exponentially weighted sample mean (EWSM) [36]. Thus, this paper also includes EWSM and exponentially weighted sample variance (EWSV) to facilitate explaining how we obtain the formula for calculating EWSS.

Calculation of Exponentially Weighted Sample Mean (EWSM)
As specified in Appendix A.7, Equation (8) for calculating EWSM is essentially a formula transformation from CSM.
wherex w(n) represents the sample mean of a time series of n observation values x 1 , x 2 , . . . , x n with exponentially decreasing weights, whilex w(n+1) indicates the updated EWSM after the new value x n+1 joins the observation time series. In particular, α is a constant coefficient between 0 and 1, and a bigger α discounts the contribution of older data faster in the EWSM calculation.

Calculation of Exponentially Weighted Sample Variance (EWSV)
Benefiting from the CSV formula (3), we define Equation (9) to calculate EWSV. Despite different notations, this equation is also available in the open source community [17].
where s 2 w(n) denotes the EWSV of a time series of n observations; s 2 w(n+1) represents the updated EWSV after a new observation x n+1 joins the time series, and β is also defined as a constant coefficients between 0 and 1, for discounting the contribution of older data in the EWSV calculation. In addition, we further constrain the relation between α and β, as shown in Equation System (10) and explained in Appendices A.8 and A.9.

Calculation of Exponentially Weighted Sample Skewness (EWSS)
Following the same derivation strategy as specified in Appendix A.9, we obtain the equation system (10) for calculating EWSS.
where α and β are constant coefficients directly reused from Equations (8) and (9), respectively; S w(n) denotes the EWSS of a time series of n observations; and S w(n+1) is the updated EWSS after a new observation x n+1 joins the time series.
Note that we intentionally deliver the current form of the EWSS formula in order to highlight the reusable intermediate results for algorithm design, e.g., (1 − β) · s 2 w(n) from Equation (9).

Algorithm Performance Evaluation
To verify the effectiveness and efficiency of our developed running skewness methods, we decided to compare them against the third-party methods reported in the open source community. However, as mentioned previously, we only found comparable counterparts for the cumulative version of running skewness i.e., CSS. Therefore, we conducted performance evaluation and comparison for CSS only. Considering that the cumulative version plays a fundamental role in developing the rolling window version and the exponentially weighted version (i.e., RSS is based on the derivation logic of CSS and EWSS is based on the formula form of CSS), the CSS evaluation results will still be able to represent the performance of RSS and EWSS.

The CSS Calculation Method from GNU Scientific Library
The first third-party method involved in our comparison is from GNU Scientific Library (GSL) (the source code is located in the folder rstat within the zipped library gsl-latest.tar.gz at https://www.gnu.org/software/gsl/, accessed on 10 August 2021). GSL is a well-known numerical library covering a wide range of mathematical routines in C language. At the time of this research, the stable version is GSL-2.6, released on 20 August 2019. To facilitate the contrast between the GSL method and ours, we abstract GSL's source code into an equation system using our notations, as shown in (11).
wherex n and S n , respectively, represent the sample mean and sample skewness of n observation values x 1 , x 2 , . . . , x n .x n+1 and S n+1 are the updated sample mean and sample skewness after a new value x n+1 joins the observation dataset. (The equation of S n+1 is extracted from the code Line 184 to Line 186 of rstat.c.) As for the auxiliary variables M2 and M3, GSL generally defines When it comes to designing algorithms from formulas, a popular and practical strategy is to save reusable intermediate results to minimize computational workloads and to speed up programs. Thus, compared to Equation System (11), the source code of GSL employs a set of temporary variables to reuse the intermediate results, as shown in Algorithm 1. It should be noted that the equation systems (4) and (11) require n > 2 and n > 1 respectively to avoid the error of division by zero. To ensure "apple-to-apple" performance comparison among different algorithms, we intentionally set n > 2 in the if statement in Algorithm 1.

Algorithm 1 Cumulative Sample Skewness in GSL
Input: The incoming observation value x.
Global Parameters: The current amount of observations n, the sample mean mean of the current observation values, and the current values of auxiliary variables M2 and M3.
Output: Cumulative sample skewness of the n + 1 observations. 1: n ← n + 1 The new observation is counted.

The CSS Calculation Method from a Go Library
Although not specified by GSL, according to Equation System (11), the GSL method corresponds to the Fisher-Pearson coefficient of skewness. To reinforce the comparison re-sults, we further identified an online statistical library written in Go (GoL) (an open-source Go library for online statistical algorithms at https://github.com/alexander-yu/stream, accessed on 10 August 2021), which employed the adjusted Fisher-Pearson coefficient of skewness. Despite the generic algorithms [45] cited by GoL, we also extract a skewnessspecific equation system from GoL's source code, as shown in Equation System (12), in order to facilitate the formula comparison.
In addition to the same notations x,x, and S used in Equation System (11), we define C2 and C3 as two auxiliary variables in this equation system, by referring to Core.sums [2] and Core.sums [3], respectively, in the source code of GoL. By contrasting between the equation systems (12) and (11) Similarly, we follow GSL's code optimisation strategy to describe the GoL method's CSS calculation into Algorithm 2. Note that similar temporary variables are used in this algorithm. Moreover, we use C2/n and C3/n respectively to replace the calculations of variables variance and moment (i.e., n n+1 · C2 n and n n+1 · C2 n ) in the source code, for the convenience of algorithm comparison. This replacement does not have a negative impact on the representation of the original algorithm in GoL, because the simplified math operations have even improved the algorithm efficiency.

Algorithm 2 Cumulative Sample Skewness in GoL
Input: The incoming observation value x.
Global Parameters: The current amount of observations n, the sample mean mean of the current observation values, and the current values of auxiliary variables C2 (i.e., Core.sums [2]) and C3 (i.e., Core.sums [3]).
Output: Cumulative sample skewness of the n + 1 observations. 1: n ← n + 1 The new observation is counted.

Performance Comparison among the Three CSS Calculation Methods
To facilitate our experimental design and analysis for the performance comparison, we refer to DoKnowMe [46], which is a domain-knowledge-driven methodology for performance evaluation. For example, we prepared different sizes of workloads as the benchmark, and we pre-decided running time and correctness as two metrics to measure the experimental responses. Furthermore, since DoKnowMe is compatible with the discipline Design of Experiments (DOE) [47], we followed the principle of repeated measures design to conduct multiple trials and measurements at each experimental setting. For the purpose of conciseness, we only report critical components of the complete evaluation workflow in this paper.

Experimental Preparation
To help guide and replicate the experimental implementations, we also use a languageindependent algorithm to describe the CSS calculation method developed in this research.
First of all, by using the two auxiliary variables V and S in the equation systems (3) and (4), we further derive Equation (13) for calculating S n+1 .
Then, we follow GSL's naming convention to include temporary variables to design the corresponding algorithm, as shown in Algorithm 3.

Algorithm 3 Cumulative Sample Skewness in this Research
Input: The incoming observation value x.
Global Parameters: The current amount of observations n, the sample mean mean of the current observation values, and the current values of auxiliary variables bbV (i.e., V n ) and bbS (i.e., S n ).

•
The first workload type is to process data one by one. In this research, the programs screen the sample dataset and return the updated CCS value right after every single new price record is included in the processing. • The second workload type is to process data chunk by chunk. In this research, the programs return the updated CSS value after every a new price records are included in the processing, while skipping the skewness calculation within the update intervals (i.e., supplementing n% a == 0 to the if statement in all the algorithms). Note that the intermediate variables outside the if-else block are still updated continuously when screening the sample dataset.

Experimental Results
Inspired by the principle of repeated measures design [47] and by the suggestion of the minimum observation amount [48], we ran the programs ten or more times at each experimental setting. At last, we obtained the three programs' average running times with respect to different workloads, as exemplified and illustrated in Figure 1. In general, the running time increases along with the size of the first-type workloads (from N = 10 4 to N = 10 6 ). As for the second-type workloads (from N = 10 6 , C = 2 to N = 10 6 , C = 10 6 ), it is not surprising that the running time keeps decreasing when skipping more and more skewness calculations. Nevertheless, skipping skewness calculations will have little impact on the running time if the update interval is larger than 100 price records. The reason is the diminishing marginal returns from the reduced workloads. Given the predefined sample size N = 10 6 , when the sample skewness is updated after every 100+ price records, the code within the if-else block will be executed less than 10 4 times. Our extensive experiments have shown that on our testbed, the program execution at such a workload level is so fast that the corresponding running time would be negligible.
Specifically, when dealing with the first-type workloads, the method from this research is slightly slower than the GSL method (about 10 ms on average in our tests), while it is clearly faster than the GoL method (almost 200 ms at the workload size N = 10 6 ). When dealing with the second type of workloads, this research shows a clear advantage over both the GSL method and the GoL method. By using the metric speedup to measure the performance difference, our method can be more than 10% times faster than the other two methods in this case, as demonstrated in Figure 1.
It should be noted that in addition to fitting in the chunk-by-chunk scenario [15], the second workload type is also practical and meaningful when practitioners only need to monitor some samples of CSS values. In practice, the three algorithms can conveniently be modified to update CSS only when receiving request signals. As such, in the case of data-intensive skewness monitoring, we can obtain significant performance enhancement even by halving the update frequency (i.e., updating CSS after every two data records). For example, by switching the workload from N = 10 6 to N = 10 6 , C = 2 (cf. Figure 1), the GSL method and the GoL method receive about 15% and 27% performance improvements, respectively, while our method sees more than 22% performance improvement.

Correctness Verification
Driven by the discussions in [8,9], we decided to use modern software systems of statistics to empirically validate the correctness of our method. For our convenience, we employ Microsoft Excel's native function SKEW() to return different data samples' skewness as reference, to verify and compare the calculation results from the GSL method, the GoL method, and our method. In particular, we choose the top N prices records (ranging from the first three records to the first one million records) as data samples from the aforementioned benchmark dataset. The selected validation results are listed in Table 2.
By referring to the output from Microsoft Excel, we claim that both our method and the GoL method can accurately calculate CSS. In contrast, the GSL method delivers different sample skewness values, although the difference becomes smaller and smaller as the sample size grows. This further confirms that due to the employment of Fisher-Pearson skewness coefficient, the GSL method may not be suitable for modern applications. More importantly, considering that online learning only uses a small set of data for its initial training, the big error of skewness from the GSL method may lead to a huge threat to the implementations of skewness-aware online edge learning. Thus, from the perspective of correctness, we suggest avoiding the GSL method when implementing skewness monitoring in production. Note that we have double-confirmed the GSL method's calculation results by running a C program that directly utilizes the relevant libraries.

Performance Analysis
In addition to the correctness, we analyse the root causes of the performance difference in the three methods by investigating their algorithms together with the experimental results. First, we discuss the reason why the GSL method has the fastest speed when dealing with one-by-one data. Considering that GSL employs the Fisher-Pearson coefficient of skewness whereas GoL and this research employ the adjusted Fisher-Pearson coefficient (cf. Appendix A.3), both the GoL method and our method have involved adjustment overhead in the CSS calculations. However, as mentioned in Section 4.3.3, the de facto software packages/products generally include an adjustment for sample size [8], and thus the GSL method may not be suitable for modern applications.
Then, we discuss the reason why our method has the fastest speed when coping with chunk-by-chunk data. Due to the benefit of the unified pseudo code, it is convenient to count that Algorithm 3 has fewer math operations outside the if-else block than the other two algorithms. Although the operation difference is as small as one multiplication only, the performance difference will be magnified due to large sample sizes, especially in the long-term data streaming scenario. Since there is no need to obtain skewness values within a data chunk (i.e., the code inside the if-else block is not executed in this case), the aforementioned adjustment overhead will be tangibly reduced even if the chunk size is two, which eventually makes our method faster than the GSL method.
As for the GoL method, in addition to the overhead of adjustment and extra multiplication, Algorithm 2 also suffers an avoidable performance loss inside its if-else block. It should be noted that the math operations here can further be simplified into the same form as our method. The current form of Algorithm 2 will be reasonable if the variance value and the skewness value are both needed. However, as shown in Algorithms 1 and 3, the explicit variance calculation is not required when monitoring skewness values only.
Overall, the performance advantage of this research is particularly meaningful and valuable in the context of online edge learning, because practitioners "strive for millisecondlevel learning; everything else comes second" [49]. Moreover, considering that IoT devices generally have limited computing and memory capabilities, minimizing operations will be more IoT-friendly and be able to bring significant marginal utility. Even in the broader context of edge cloud computing, any (even small) performance improvement will matter, as "every drop of 20 ms . . . latency will result in a 7-15% decrease in page load times" [50].

Validation and Discussion about Skewness-Aware Online Edge Learning
According to the existing studies on skewness-aware machine learning, the typical strategies of dealing with skewed data include: (1) collecting a suitable number of observations to build a balanced training dataset [5,19], (2) using log transformation to normalise the distribution [19], (3) removing the skewed features when training models [33], (4) employing a penalty function to reduce the impact of skewness [7], etc. Benefiting from the running skewness methods developed in this research, we argue that practitioners will be able to adapt (at least some of) these strategies to different scenarios of online edge learning. As described in the following subsections, we conducted an experimental investigation and proposed a theoretical mechanism to justify the feasibility and usefulness of being aware of skewness in online edge learning.

Automatic Decision Making in Sample Size for Initial Training
Before making predictions at runtime, online learning needs to accumulate a certain number of training data to obtain the first version of model. The de facto strategy seems to use a random or an experience-based size of samples for initial training. For example, the study [14] initializes online learning models using the first 100 observations by default, or using a manually input number of observations. (The variable numLags at https://github.com/chickenbestlover/Online-Recurrent-Extreme-Learning-Machine/ blob/master/run.py, accessed on 10 August 2021).
Inspired by the aforementioned first strategy of dealing with skewed data, we propose employing CSS to monitor the skewness of accumulated samples against a predefined threshold, so as to decide the initial training dataset automatically. For ease of comparison, we directly reused the online learning implementations in [14] and slightly modified the source code into a skewness-aware version. Firstly, we commented out the code of standardising the whole dataset, because the online learning "model does not know how many data will be presented" [26]. Secondly, we supplemented Algorithm 3 to the original source code for real-time monitoring of skewness. In particular, since the reported datasets are generally balanced (cf. Section 5.2), we define the skewness threshold to be 0.05 after taking the absolute value.
When the online-sequential extreme learning machine (OS_ELM) over the New York City taxi passenger dataset (i.e., nyc_taxi.csv) is used as an example, the time-series prediction results from executing the original program and our skewness-aware version are both plotted in Figure 2. It should be noted that we kept the default neural network setting (i.e., 100 input neurons, 23 hidden neurons, and 1 output neuron) in this test. In general, every single neuron in a neural network will receive a group of weighted inputs and then return an output after applying an activation function. Given this typical three-layer architecture of the default setting, the input-layer neurons represent individual attributes of the dataset to be learned, the hidden-layer neurons are equipped with nonlinear activation functions to be able to solve non-linear problems, and the output-layer neurons can eventually deliver an output that represents the neural network's prediction. Compared with the original implementation that uses 100 observations (with skewness value −0.5589) in the initial training, our skewness-aware program automatically selects the first 26 observations (with skewness value −0.0157) to train the initial model. According to the plot shown in Figure 2, intuitively, using fewer training data does not seem to make negative impact on the predication results. On the contrary, the predicted values from the skewness-aware version even better fit the observed values in many cases, e.g., the time series points from 12,150 to 12,160 and from 12,200 to 12,210.
To compare our skewness-aware model with the original model in an objective way, we decided to assess their prediction accuracy quantitatively. Specifically, we follow the study [14] to employ the normalised root-mean-square error (NRMSE) to measure the difference between the observed values and the predicted values. Without standardising the whole dataset, the NRMSE of the original OS_ELM is about 70.06%, while the skewnessaware OS_ELM has a clearly better forecast accuracy at a lower NRMSE value 53.88%.
We further tested different neural network sizes by gradually increasing the number of hidden neurons. The skewness-aware OS_ELM seems always to have an advantage over the original OS_ELM implementation, as listed in Table 3, although the advantage may become trivial with large neural networks. This finding reveals that given a better-trained initial model, the following online learning work can have a higher prediction accuracy, which also makes common sense that "the slightest difference leads to a huge error".

A Redundancy Mechanism for Skewness-Aware Online Edge Learning
When it comes to the skewness awareness after the initial training stage, inspired by using log transformation to deal with high skewness, we propose maintaining two online learning models respectively over the observed data and over the transformed data at runtime. As such, given a predefined threshold, the online learning system can keep monitoring skewness and accordingly switch between the two models to output predicted values, as illustrated in Figure 3. As for the threshold, according to the suggestions from the literature [51], we distinguish between balanced data and skewed data by checking whether or not the absolute skewness value is less than three. Such a redundancy mechanism may be particularly suitable for the partially skewed scenario where the online learning logic involves a time/observation window or a forgetting mechanism. Correspondingly, RSS or EWSS can be employed to monitor skewness in this case. It should be noted that this redundancy inevitably requires double computing resources for a single online learning task. Therefore, the cost-benefit analysis should play a prerequisite role in applying the redundancy mechanism to real-world edge environments. In addition, the log transformation can be replaced with other suitable techniques of curing skewed data when implementing this redundancy mechanism.
At the time of writing this paper, we are still exploring suitable (partially skewed) time series data to validate the proposed redundancy mechanism. As discussed in Section 2, it seems that few skewed time series are employed and shared in the community. The lack of reusable skewed data confirms the widely known fact of publication bias, i.e., the existing research may (have to) have intentionally selected balanced datasets to study and publish, because researchers can barely obtain positive results from the online learning techniques over skewed data [6].
Considering the ever-evolving nature of the discipline, we hope to use this proposed redundancy mechanism to inspire more collaboration in skewness-aware online edge learning.

Conclusions and Future Work
It is known that machine learning can never do a good job with skewed data. Thus, pre-processing the whole datasets for curing high skewness (e.g., log transformation) has been a common practice before applying machine learning technologies. In edge computing, however, there is no way to conduct such data pre-processing, because it is impractical and impossible to store all the past observations and foresee all the future observations. For the same reason, machine learning at the Internet edge is generally implemented in an online fashion. Correspondingly, it will be valuable to monitor the running skewness of continuous observations before being able to take suitable actions to cure high skewness at runtime.
Nevertheless, when trying to integrate skewness monitoring to online edge learning, we identified a surprising gap between practitioners' needs and the scientific research in running skewness algorithms. This research bridges the gap by developing a set of IoTfriendly statistical methods to facilitate monitoring skewness at runtime. These methods are essentially based on Welford's algorithm, which is the most efficient way to calculate running variance. We believe that as an innovation in computer science, Welford's algorithm might have been overlooked in the statistics community.
More importantly, this research has initially validated the usefulness and significance of being aware of skewness in the implementations of online edge learning. To help boost this promising research field, our future work will unfold along two directions. Firstly, we plan to keep exploring and employing more datasets to strengthen the current validation results. Secondly, we will try to collaborate with more online learning researchers on developing real-time skewness remedies. When a new observation value x n+1 arrives, the sum of the first n values can be substituted by n ·x n for calculating the arithmetic average of the n + 1 values, i.e., It is clear thatx 0 is zero by default when no value is involved in the average calculation. This completes the proof of Equation (2). Proof. Given a series of n discrete random values x 1 , x 2 , . . . , x n as the sample dataset, the unbiased estimator for the population variance is: By introducing the auxiliary notation V, this can be rewritten as: When a new value x n+1 arrives, a straightforward modification of the unbiased estimator can be: Then, we calculate the differences between V n+1 and V n via Equation (A6).
Recall that ∑ n i=1 x i = n ·x n ; we have: i.e., The sample standard deviation can conveniently be obtained by taking the square root of the unbiased estimator s 2 n+1 . In addition,x 0 and s 2 0 are initialised as zero by default when no value is involved in the calculation. This completes the proof of Equation System (3).

Appendix A.3. Proof of Equation System (4) for Calculating Cumulative Sample Skewness (CSS)
Proof. Given a series of n observation values x 1 , x 2 , . . . , x n , the symmetry of their distribution is measured by the sample skewness S n : By introducing the auxiliary notation S, this can be rewritten into: When the dataset includes a new value x n+1 , we have: Thus, In particular,x 0 , s 0 , and S 0 are all initialized as zero by default when no observation is involved in the calculation. This completes the proof of Equation System (4). Proof. Given n observations, the arithmetic averagex n k of the most recent k values is: After observing a new value x n+1 , the sample mean within the same size of rolling window can be updated asx n+1 k in a recursive manner.
In practice, the sample meanx k k of the first k values should be initialized before rolling the observation window, and thus the total amount of observations n must be at least equal to k and the rolling window size k should not be zero. This completes the proof of Equation (5).
By using the notation V to simplify the representation, we rewrite this equation into: After observing a new value x n+1 , the k-item sample variance should be updated as: Following the same strategy as in Appendix A.2, we calculate the differences between V n+1 k and V n k via Equation (A20).
In particular, Similarly, the sample variance s 2 k k of the first k values should be initialised before rolling the observation window, and thus the total number of observations n must be at least equal to k, while the rolling window k must be bigger than 1 to avoid the error of division by zero. This completes the proof of Equation System (6).
Appendix A.6. Proof of Equation System (7) for Calculating Rolling Sample Skewness (RSS) Proof. Given n observations, the sample skewness S n k of the most recent k values is: By using the notation S to simplify the representation, we rewrite Equation (A23) into: After observing a new value x n+1 , Equation (A24) can be updated as: Following the same strategy as in Appendix A.3, we can firstly focus on ∑ n i=n−k+1 ( On the other hand, Then, To initialise the sample skewness S 2 k k of the first k values before rolling the observation window, the total number of observations n must be at least equal to k, while the rolling window k must be bigger than two to avoid the error of division by zero. This completes the proof of Equation System (7). and then: By reusing the predefined constant coefficients, i.e., replacing 1 n+1 and 1 n with α and β respectively, we have: where S w(n) differs from S n and denotes the sample skewness that involves the past weighting effects. As regulated previously, the constant coefficients α and β are two decimal numbers between 0 and 1. Since both of them are used in the EWSS formula, for the purpose of consistence, we further regulate their relationship as follows through the replaced items. 1 This completes the proof of Equation System (10).