Implementation Strategy of Convolution Neural Networks on Field Programmable Gate Arrays for Appliance Classification Using the Voltage and Current (V-I) Trajectory

Specific information about types of appliances and their use in a specific time window could help determining in details the electrical energy consumption information. However, conventional main power meters fail to provide any specific information. One of the best ways to solve these problems is through non-intrusive load monitoring, which is cheaper and easier to implement than other methods. However, developing a classifier for deducing what kind of appliances are used at home is a difficult assignment, because the system should identify the appliance as fast as possible with a higher degree of certainty. To achieve all these requirements, a convolution neural network implemented on hardware was used to identify the appliance through the voltage and current (V-I) trajectory. For the implementation on hardware, a field programmable gate array (FPGA) was used to exploit processing parallelism in order to achieve optimal performance. To validate the design, a publicly available Plug Load Appliance Identification Dataset (PLAID), constituted by 11 different appliances, has been used. The overall average F-score achieved using this classifier is 78.16% for the PLAID 1 dataset. The convolution neural network implemented on hardware has a processing time of approximately 5.7 ms and a power consumption of 1.868 W.


Introduction
Non-intrusive load monitoring (NILM), is an energy analytics discipline that hinging on machine-learning and signal-processing techniques estimates the consumption of individual appliances from global measurements taken at a limited quantity of locations in the grid [1]. NILM has gained prominence due to the potential of using individual appliance consumption to promote energy saving behavior in individuals and to power a number of applications in support of the energy efficient home concept, which are expected to significantly reduce the carbon footprint associated with electric energy consumption [2].

V-I Shapes and NILM
V-I shapes were first applied to NILM research by Lam et al. [7]. It was shown that WS features based on the V-I trajectories provide high discrimination among different types of appliances. Later, Hassan et al. [8], extended the original set of WS features, and performed an empirical validation of such features using the REDD dataset [14]. More concretely, they benchmarked the WS features against standard features (real and reactive power consumption-PQ, and the harmonic content of the current waveforms-HAR) using four benchmark algorithms (artificial neural networks-ANN, artificial neural network coupled with an evolutionary algorithm-ANN-EA, support vector machines-SVM, and adaptive boosting-AdaBoost). Ultimately, the obtained results have shown that the selected WS outperform the traditional features in all four problems, further suggesting the high-discrimination capability of such features. Iksan et al. [15] have evaluated the potential of including two WS features (enclosed area-EA, and curvature of the mean line-CML) in a hybrid signature along with active power-P, reactive power-Q, power factor-PF, and total harmonic distortion-THD. Their approach was tested against the REDD dataset using a Naïve Bayes algorithm, and the results have shown an increase from 55% to 91% in the overall classification accuracy when EA and CML were added to the feature space.
To the best of our knowledge, the first approach to extract data-driven features from V-I shapes was the work of Gao et al. [6] using the PLAID dataset [16]. This work has converted the normalized V-I shapes to binary images for training machine-learning (k-nearest neighbor (1-NN), logistic regression classifier-LGC, Gaussian naïve Bayes-GNB, SVM, decision trees-DT, and random forests-RF). In this setup, the best average accuracy was obtained for the raw binary image with 81.75%, whereas the principal components achieved an average accuracy of 77.65%. Nevertheless, the best accuracy (86.03%) was obtained when combining the raw binary image with engineered features (P, Q, the first 11 odd harmonics, and the quantized version of the current waveform). In [17] the authors expanded the work from [6] and the raw steady-state current and voltage waveforms as training data for an ensemble of neural networks. Their approach was also tested against PLAID, and the obtained results show an average accuracy of 89%. In order to improve the discriminative power of V-I trajectories within appliance categories, Du et al. [18] have proposed novel methods to create binary images from V-I trajectories and to extract learning features from such images (e.g., the number of continuums of occupied cells , and the existence or not of self-interceptions). The proposed methods were extensively tested against an undisclosed dataset using a supervised self-organizing map (SSOM) learning algorithm. The obtained results showed an average accuracy of 99%. Strategies from object recognition were also attempted in [19]. Contours and elliptical Fourier descriptors were combined to extract features from the binary representation of the V-I trajectories. The proposed features were evaluated against the PLAID dataset using three supervised learning algorithms (ANN, logistic regression-LR, and RF). The results have shown an average accuracy of about 80% for the RF, which is comparable to those reported in [6] but still far from the results shown in [17] for the same dataset.
Finally, De Beaets et al. [9] proposed an application of CNNs to discriminate among weighted pixelated image representations of the V-I trajectories. The proposed approach was tested against two datasets, PLAID [16] and WHITED [20]. The obtained results were reported using the F-measure for individual appliance classification, and the macro-average F-measure for the overall classification. The reported macro F-measures were 77.6% and 75.46% for PLAID and WHITED, respectively.

Convolution Neural Network Backgroud
Because of the distinguishable signature in the V-I trajectory picture, visual deep learning has been used to classify different appliances. One of the most popular types of visual deep learning classifier to processing 2D data is known as CNN [21]. A CNN extracts local features at high resolution and combines them into high-level complex features at low resolutions. In order to achieve this goal, CNNs have convolution and pooling layers accompanied with activation functions and, in the end, there are a fully connected layer with a softmax function [22]. The convolution layer produces at the output a feature map using different convolution kernels [23]. During the training phase, the CNN learns the values of the kernels for a particular task [24]. Taking into account the entire convolution layer, the feature maps can be seen as a three-dimensional (3D) map [25]. The equation for the feature map of the 3D convolution layer is given by Equation (1).
where 1 ≤ d < nk d , nk d is the number of convolution kernels in a layer, C is the feature map of the entire convolution layer (C ∈ R i×j×nk d ), is the 2D convolution operation, k is the kernel, f is the input matrix, b is the bias and ϕ is the rectified linear unit (ReLU) given by Equation (2).
The other layer of the CNN is the pooling layer, with nonlinear activation functions. It down-samples feature maps of the previous layer, reducing the artifacts and sharp variations on the signal that comes out from the convolution layer [26]. In the pooling layer, the max-pooling (σ pool ) is applied followed by the ReLU function [25]. The pooling layer output (P) can be expressed by Equation (3).
f represents the intermediate feature maps and d is the number of pooling filters in the layer. The fully-connected layer, whose basic element is a neuron, is the last layer in the CNN. A neuron is based on a simple function defined by Equation (4) [27]. where f is the input, w the weight, n the number of inputs and ϕ is the softmax function. Also known by normalized exponential function [28] , can be used to represent a categorical distribution, that is, a probability distribution over k different possible outcomes (Equation (5)).

FPGA Implementations of NILM and CNNs
For mitigating the individual load management problem, several attempts to deploying hardware and software platforms have been made. For example, using a multi-channel data acquisition board (LabJack U6) with a processing unit (Toshiba NB300, Tokyo, Japan) [29]. However, this implementation has two-fold limitations. First, a full mini laptop (Toshiba NB300) was connected with the sensor (LabJack U61) and, then, the appliance identification is made on-line. By using on-line identification method, the system has to transfer a lot of raw data producing a huge amount of data to be communicated through the internet. For that reason, this kind of system could not operate in low bandwidth connections. In addition to this, [30,31] are other works aimed primarily at increasing the processing speed of the underlying NILM systems using FPGAs. Remscrim et al. [30] propose the implementation of a spectral envelope processor consisting of four subsystems to: (i) current and voltage acquisition; (ii) compute the spectral envelope coefficients; (iii) store the computed coefficients on an erasable memory, and (iv) transmit the coefficients via Wi-Fi. Trung et al. [31] propose to use an FPGA to implement a cumulative-sum (CUMSUM) filter for real-time noise reduction in commercial and industrial installations. Note that in any of those cases, the actual classification step is not performed on the FPGA itself.
In the present work, a Zynq System-on-a-Chip (SoC) was used to implement a CNN classifier for identifying appliances. This SoC has a dual-core ARM Cortex-A9 processors and an FPGA (28 nm Artix-7 based). An important aspect of this implementation is the possibility of connecting to data acquisition sub-system and internet through the processor. The main contribution of this work is the execution of a CNN for appliance identification on an FPGA. The advantage of using FPGA is the exploration of parallelism to speed up the process. Figure 1 depicts the proposed system using a Zynq device. this implementation has two-fold limitations. First, a full mini laptop (Toshiba NB300) was connected with the sensor (LabJack U61) and, then, the appliance identification is made on-line. By using online identification method, the system has to transfer a lot of raw data producing a huge amount of data to be communicated through the internet. For that reason, this kind of system could not operate in low bandwidth connections. In addition to this, [30,31] are other works aimed primarily at increasing the processing speed of the underlying NILM systems using FPGAs. Remscrim et al. [30] propose the implementation of a spectral envelope processor consisting of four subsystems to: (i) current and voltage acquisition; (ii) compute the spectral envelope coefficients; (iii) store the computed coefficients on an erasable memory, and (iv) transmit the coefficients via Wi-Fi. Trung et al. [31] propose to use an FPGA to implement a cumulative-sum (CUMSUM) filter for real-time noise reduction in commercial and industrial installations. Note that in any of those cases, the actual classification step is not performed on the FPGA itself.
In the present work, a Zynq System-on-a-Chip (SoC) was used to implement a CNN classifier for identifying appliances. This SoC has a dual-core ARM Cortex-A9 processors and an FPGA (28 nm Artix-7 based). An important aspect of this implementation is the possibility of connecting to data acquisition sub-system and internet through the processor. The main contribution of this work is the execution of a CNN for appliance identification on an FPGA. The advantage of using FPGA is the exploration of parallelism to speed up the process. The implemented CNN is connected to the Processing Subsystem (PS) to ensure a correct communication with the ARM CPU. This connection was made through a direct memory access (DMA), in the programmable logic (PL) subsystem. The DMA is connected through an extended interface to move data into and from the design-exploiting the Advanced eXtensible (AXI) handshake protocol [32].
A specific computational engine is an efficient requisite to implement a CNN on the FPGA. One of the first CNN implementations on hardware dates back to the early 90's where an ANNA chip (a mixed analog/digital neural-network chip) was used to implement a CNN [33][34][35]. Korekado et al. [36] proposed a VLSI architecture of high performance and low power was used to implement a CNN. This architecture uses a hybrid approach composed of pulse-width modulation (PWM) and a digital circuitry. Fieres et al. [37] combined digital multiple FPGA-VLSI mixed models are combined to implement the first model simulated on computer by Fukushima et al. [21]. Farabet et al. [38] propose a different implementation of a CNN on an FPGA. In this implementation, all basic operations of a CNN were implemented at hardware level (e.g., 2D convolution; 2D pooling; etc.) and macro-instructions are provided to execute them in any order. The sequencing of operations is The implemented CNN is connected to the Processing Subsystem (PS) to ensure a correct communication with the ARM CPU. This connection was made through a direct memory access (DMA), in the programmable logic (PL) subsystem. The DMA is connected through an extended interface to move data into and from the design-exploiting the Advanced eXtensible (AXI) handshake protocol [32].
A specific computational engine is an efficient requisite to implement a CNN on the FPGA. One of the first CNN implementations on hardware dates back to the early 90's where an ANNA chip (a mixed analog/digital neural-network chip) was used to implement a CNN [33][34][35]. Korekado et al. [36] Energies 2018, 11, 2460 6 of 18 proposed a VLSI architecture of high performance and low power was used to implement a CNN. This architecture uses a hybrid approach composed of pulse-width modulation (PWM) and a digital circuitry. Fieres et al. [37] combined digital multiple FPGA-VLSI mixed models are combined to implement the first model simulated on computer by Fukushima et al. [21]. Farabet et al. [38] propose a different implementation of a CNN on an FPGA. In this implementation, all basic operations of a CNN were implemented at hardware level (e.g., 2D convolution; 2D pooling; etc.) and macro-instructions are provided to execute them in any order. The sequencing of operations is managed at software level based on a PowerPC processor (e.g., management of data transfer from/to an external system; store/retrieve kernels from/to external memory).
More recently, CNN accelerators designed based on high level synthesis (HLS) approaches have been developed. Zhang et al. [39] shows an optimized CNN accelerator designed with HLS, reordering and tiling loops, inserting pragmas and organizing external memory transfers. In [40] CNNs were implemented on HLS by adopting an N-fold approach, particularly suitable for devices with strict restrictions on power and resources consumption. Cost-effective acceleration of CNNs on FPGAs at datacenter scale has been studied by Ovtcharov [41].
CNNs have been widely adopted in many applications, either using Graphics Processing Unit (GPU) or dedicated hardware [40][41][42]. Contrariwise to V-I shapes, and the disaggregation problem overall that have received a great amount of attention from the research community in the past few years, the use of CNN as appliance identifiers, mainly on FPGA implementation, are very scare in the NILM domain.

Dataset
The Plug Load Appliance Identification Dataset (PLAID) [16] is used in this work. The dataset actually contains two datasets: PLAID 1 has 1074 samples and PLAID 2 has 719 samples, for a total of 1793 samples. PLAID 2 has been released recently (in 2017). Most of literature using the PLAID dataset refers to the PLAID 1 dataset. PLAID 1 was obtained from 55 houses and PLAID 2 was obtained from nine houses. Both datasets were combined for training the deep neural network. Thus, the data was created by combining PLAID 1 and PLAID 2 dataset resulting in a total of 64 houses. This dataset, collected in Pittsburgh (PA, USA), is constituted by current values and voltage values measured from 11 different electrical appliances-compact fluorescent lamp (CFL), fridge, hairdryer, microwave, air conditioner (AC), laptop, vacuum, fan, washing machine (WM), incandescent light bulb (ILB) and heater-in more than 60 houses. The sampling frequency is 30 kHz. The system was tested using leave one house out method. The sample is done without replacement.

Data Pre-Processing
A voltage and current trajectory (V-I) was mapped on a 2D plot, producing a disguisable signature or pattern for each appliance [7,8]. This plot can be used for classification of different appliances by using visual feature of the plot or picture such as shape. Voltage (v) and current (i) were acquired over time at a sampling frequency ( f s ) and grid frequency f g . Once the complete period, the resultant wave is described by Equation (6) [17].
where d is the number of samples per period ( f s / f g ) and p is a point in time. When more than one period of ϕ p is collected, the resultant wave is named window size. In this work, three different window sizes were analyzed: 83 ms (5 periods), 166 ms (10 periods) and 333 ms (20 periods). Figure 2 shows the graphical representation of the V-I trajectory for each appliance and window size. Visualizing the individual signature requires that data is summarized and sampled in such a way that represents the number of points that are plotted on the screen. Although each appliance presents its individual signature [8], different brands or models could present signatures with different magnitudes due to their different power consumption [7]. Because of this, getframe function from Matlab2018 was used keeping the same size in the axes as the maximum and minimum of the periods was used. Lastly, imresize function from Matlab2018 was used to reduce images to 50 × 50 dimension.

CNN for Appliance Classification
The input of the network is a 50 × 50 matrix which is considered as a single channel image. The first layer, C1, performs four convolutions with 3 × 3 kernels on the input, producing 4 feature maps of size 48 × 48. In the second layer, P1, executes 2 × 2 spatial pooling of each feature map. The third layer, C2, performs 3 × 3 convolutions to calculate high-level features. Layer P2 performs 2 × 2 pooling similar to P1. The C3 layer performs 3 × 3 convolutions. Finally, F1 is a linear classifier having 11 neurons containing the softmax function as activation function. The CNN architecture is summarized in Table 1. The Adam algorithm with a learning rate of 0.001 was used to compute the trainable parameters using Adam algorithm along of 200 epochs, which is justifiable by the large data set used [43].

CNN for Appliance Classification
The input of the network is a 50 × 50 matrix which is considered as a single channel image. The first layer, C1, performs four convolutions with 3 × 3 kernels on the input, producing 4 feature maps of size 48 × 48. In the second layer, P1, executes 2 × 2 spatial pooling of each feature map. The third layer, C2, performs 3 × 3 convolutions to calculate high-level features. Layer P2 performs 2 × 2 pooling similar to P1. The C3 layer performs 3 × 3 convolutions. Finally, F1 is a linear classifier having 11 neurons containing the softmax function as activation function. The CNN architecture is summarized in Table 1. The Adam algorithm with a learning rate of 0.001 was used to compute the trainable parameters using Adam algorithm along of 200 epochs, which is justifiable by the large data set used [43].

CNN Implementation on FPGA
The CNN described in the previous section (Table 1) was modeled in C code. Through of a Register Transfer Level (RTL) design, the solution was optimized and exported as an intellectual property (IP) core. Therefore, the IP was implemented on a FPGA. FPGAs use hardware for configurable processing logic (PL), thus making these very fast and flexible processing devices. Figure 3 depicts the proposed CNN architecture implemented on the IP core.
In the first part, the design computes the respective operating layer, i.e., convolution-C, pooling-P or full-F operation. The kernel-K, bias-b and weight-W values are accessed from memory to execute the convolution and full operation, respectively. Also, loop unrolling was applied to exploit parallelism between loop iterations. It creates multiple copies of the loop body. More parallelism means higher system performance and, consequently, more throughput. The objective of optimization is to enable efficient loop unrolling for fully using all the resources provided by the FPGA. Taking into account the resources available in 28 nm Artix-7 [44], adopted in this work, there were available resources to implement the parallelism in the multiplication/accumulation for the convolution (Figure 4a) and filter pooling (Figure 4b).
In the second part of the CNN design, shown in Figure 3, the activation function is computed (ReLU or softmax function). The approach to compute the ReLU considers for the domain x ≤ 0 then f (x) = 0 otherwise, f (x) = x.

CNN Implementation on FPGA
The CNN described in the previous section (Table 1) was modeled in C code. Through of a Register Transfer Level (RTL) design, the solution was optimized and exported as an intellectual property (IP) core. Therefore, the IP was implemented on a FPGA. FPGAs use hardware for configurable processing logic (PL), thus making these very fast and flexible processing devices. Figure 3 depicts the proposed CNN architecture implemented on the IP core.
In the first part, the design computes the respective operating layer, i.e., convolution-C, pooling-P or full-F operation. The kernel-K, bias-b and weight-W values are accessed from memory to execute the convolution and full operation, respectively. Also, loop unrolling was applied to exploit parallelism between loop iterations. It creates multiple copies of the loop body. More parallelism means higher system performance and, consequently, more throughput. The objective of optimization is to enable efficient loop unrolling for fully using all the resources provided by the FPGA. Taking into account the resources available in 28 nm Artix-7 [44], adopted in this work, there were available resources to implement the parallelism in the multiplication/accumulation for the convolution (Figure 4a) and filter pooling (Figure 4b).
In the second part of the CNN design, shown in Figure 3, the activation function is computed (ReLU or softmax function). The approach to compute the ReLU considers for the domain x ≤ 0 then    The implementation of the softmax function on a FPGA is itself a very challenging task. A hybrid solution was implemented on hardware to compute the exponential function. The hybrid solution consists in decomposing the exponent function into an integer and a fractional part (Equation (7)).
where frac(x) represents the fractional part of x and int(x) is the integer part of x. The ( ) values are calculated through a polynomial interpolator. Chebyshev interpolation, with the interpolation nodes more concentrated in the extremity comparatively to the classic techniques, was adopted [45,46]. The ( ) values are stored in a ROM [47], in this implementation values between e −30 to e 30 are stored. Lastly, the design implemented on hardware was exported as an IP core and its integration into a functional FPGA design was completed using the software Vivado Design Suite (Xilinx, San Jose, CA, USA).

Evaluation Metrics
The F-Score is used to measure the classification performance in order to compare this classifier with previous publications already referred. The F-Score is the balance between of precision and sensitivity for each appliance. Per appliance of each house, F-Score can be expressed as: where m representing the m-th house, representing the true positives (i.e., the quantity of appliance appropriately labeled as belonging to the positive class), is the false positives (i.e., the quantity of appliance erroneously labeled as belonging to the class) and is the false negatives (i.e., the quantity of appliance which were not labeled as belonging to the positive class but should have been). Lastly, the average of all the F-measures are calculated using Equation (11). The implementation of the softmax function on a FPGA is itself a very challenging task. A hybrid solution was implemented on hardware to compute the exponential function. The hybrid solution consists in decomposing the exponent function into an integer and a fractional part (Equation (7)).
where frac(x) represents the fractional part of x and int(x) is the integer part of x. The e f rac(x) values are calculated through a polynomial interpolator. Chebyshev interpolation, with the interpolation nodes more concentrated in the extremity comparatively to the classic techniques, was adopted [45,46]. The e int(x) values are stored in a ROM [47], in this implementation values between e −30 to e 30 are stored. Lastly, the design implemented on hardware was exported as an IP core and its integration into a functional FPGA design was completed using the software Vivado Design Suite (Xilinx, San Jose, CA, USA).

Evaluation Metrics
The F-Score is used to measure the classification performance in order to compare this classifier with previous publications already referred. The F-Score is the balance between of precision and sensitivity for each appliance. Per appliance of each house, F-Score can be expressed as: where m representing the m-th house, TP m representing the true positives (i.e., the quantity of appliance appropriately labeled as belonging to the positive class), FP m is the false positives (i.e., the quantity of appliance erroneously labeled as belonging to the class) and FN m is the false negatives (i.e., the quantity of appliance which were not labeled as belonging to the positive class but should have been). Lastly, the average of all the F-measures are calculated using Equation (11).

Power and Temperature Effects on the FPGA
On the other hand, in the implementation on FPGA, the total on-chip power on the FPGA is the power consumed inside the FPGA. The thermal power is given by Equation (12) [48,49]. P = P dynamic + P static (12) The dynamic power depends on the activity and capacitance of the circuit, while the static power depends on the process properties, manufacturing, the device junction temperature and applied voltage. The device junction temperature is the temperature of the device operation (Equation (13)) [48,49].
where T A is the temperature of the environment, θ J A is the effective thermal resistance, which describes the quantity of power is dissipated from the FPGA silicon to the environment. The maximum junction temperature supported by the Zynq 7000 device (Xilinx, San Jose, CA, USA) is 85 • C [50].  Figure 9 shows the overall F-score of PLAID 1 and PLAID 2 dataset. The mean F-score of combined dataset was done by weighted methods because the dataset PLAID 1 and PLAID 2 does not have equal number of houses. PLAID 1 and PLAID 2 are weighted according to their respective house numbers. For mean F-score it can be seen that, with the increasing of the window size, the F-score value of appliance classification from each house improves and, consequently, the F-score of the overall classification are also enhanced. The highest total F-score was achieved using a 333 ms window size (20 periods) hence in this work a 333 ms window is considered.

Power and Temperature Effects on the FPGA
On the other hand, in the implementation on FPGA, the total on-chip power on the FPGA is the power consumed inside the FPGA. The thermal power is given by Equation (12) [48,49].
The dynamic power depends on the activity and capacitance of the circuit, while the static power depends on the process properties, manufacturing, the device junction temperature and applied voltage. The device junction temperature is the temperature of the device operation (Equation (13)) [48,49].
where is the temperature of the environment, is the effective thermal resistance, which describes the quantity of power is dissipated from the FPGA silicon to the environment. The maximum junction temperature supported by the Zynq 7000 device (Xilinx, San Jose, CA, USA) is 85 °C [50].  Figure 9 shows the overall F-score of PLAID 1 and PLAID 2 dataset. The mean F-score of combined dataset was done by weighted methods because the dataset PLAID 1 and PLAID 2 does not have equal number of houses. PLAID 1 and PLAID 2 are weighted according to their respective house numbers. For mean F-score it can be seen that, with the increasing of the window size, the F-score value of appliance classification from each house improves and, consequently, the F-score of the overall classification are also enhanced. The highest total F-score was achieved using a 333 ms window size (20 periods) hence in this work a 333 ms window is considered.

Performance and Cost
The parallelism was implemented in all multiplications/accumulations into the convolution layer and in all filters into the polling layer. Table 2 presents the required resources, performance and power  Analyzing the Table 2, the design spent, mostly, LUT and BRAM for storing the kernel and weights values and for saving temporary the results of each layers. The DSPs are used to compute the basic operation of the CNN. The resulting resources are also used to implement counters and the controller through a state machine. The latency of this design is approximately 5.7 ms, which is lower than the window size (333 ms). Not only this implies that the classification is done before finishing the next window size but also it is possible to run overlapping window with 5.7 ms slide. To further complement our results, a performance using a CPU and a GPU have been measured. The CPU is an i7-Intel (4-core) with a clock speed of 4.20 GHz and a RAM of 32 GB. The GPU is a GeForce Nvidia GTX 1050 Nvidia 768 Compute Unified Device Architecture (CUDA) Cores. The performance were 69 ms using CPU and 87 ms using the GPU. In this case, it is assumed that CUDA has a start-up overhead because for small CNNs, like this one. To state more concretely, although the GPU bound code will run faster than the CPU code, the cost to transfer the data to and from the GPU will outweigh any gains from using the GPU. Consequently, since the transfer of data between CPU and GPU is a requirement of our application, GPU will not represent significant gains. Table 3 summarizes our F-score values using leave-one-house method and compares to recently publish results, using the same dataset, where available.

Comparison Results
The obtained F-score value of the proposed system is compared to those of the ensembles classifiers [17] and the CNN classifier [9]. Comparing the proposed system with these two recent works, the proposed system presents a slightly higher F-score than the other CNN work. The proposed system has lower F-score value than the ensemble methods [17]. The model presented in [17] has an ensemble of 55 feed-forward neural network with 30 hidden neurons. If all hidden neurons are combined, the model would have 1650 neurons. Comparing this model with our model, our model is small. However, the proposed classifier presents a difference of 8% on the total F-score when compared with the model using a Neural Network Ensembles. This difference comes because the neural network ensemble classifies the appliances using the raw current and voltage waveforms as inputs instead of V-I trajectory. In the other hand, this difference comes mostly from the low F-score value of AC and fan. From our understanding, this misclassification is due to the similarities of V-I trajectories of these two appliances. This similarity comes of the fan which is coupled inside of an AC. In addition, the proposed model do not need any extra feature extraction method which also reduce the size of the implemented system. Furthermore, different from what has been developed in this work, the neural network ensemble classifies the appliances using the raw current and voltage waveforms as inputs instead of V-I trajectory. Figures 10 and 11 present box and whisker plots showing the median of F-score, the outliers and the 25th and 75th percentiles. Analyzing these figures, the appliances CFL and vacuum present the best F-score. On the other hand, the fan presents the worst F-score. Still, looking at Figures 10 and 11, the result has fluctuations. The reason for that could be due to some houses do not have all appliances. In fact, the worst case is the fan where the F-score values fluctuate from 0% to 100%. This fluctuation contribute to decrease the overall score. Also, looking at the remaining appliances, the outliers presented in the remaining appliances have a negative contribution in the overall performance of the system. V-I trajectories of these two appliances. This similarity comes of the fan which is coupled inside of an AC. In addition, the proposed model do not need any extra feature extraction method which also reduce the size of the implemented system. Furthermore, different from what has been developed in this work, the neural network ensemble classifies the appliances using the raw current and voltage waveforms as inputs instead of V-I trajectory. Figures 10 and 11 present box and whisker plots showing the median of F-score, the outliers and the 25th and 75th percentiles. Analyzing these figures, the appliances CFL and vacuum present the best F-score. On the other hand, the fan presents the worst F-score. Still, looking at Figures 10 and 11, the result has fluctuations. The reason for that could be due to some houses do not have all appliances. In fact, the worst case is the fan where the F-score values fluctuate from 0% to 100%. This fluctuation contribute to decrease the overall score. Also, looking at the remaining appliances, the outliers presented in the remaining appliances have a negative contribution in the overall performance of the system.   Making a comparison between the developed local identification method and the on-line identification method, the proposed methods is advantageous. The disadvantage of the on-line identification method is the raw data needs to be transferred by internet into the classifier of appliances which is located on the cloud. Conversely, if the proposed method is used, only the answer of system will be sent through the internet. Thus, the compression ratio, r, is defined by: Now, considering the sampling frequency, , of 30 kHz, the window size period of 20 to develop the classifier and the grid frequency, , of 60 Hz in USA (50 Hz in EU). If the online identification method is used, 10,000 samples needed to be sent through the internet. Conversely, if the proposed method is used, only the answer of system will be sent through the internet. So, it is expected to reduce the communication bandwidth requirement about 4 orders of magnitude.

Conclusion and Future Work Directions
The present work proposes a strategy to identify appliances through analyzing changes in the voltage and current and deducing what appliances are used in the house as well as their individual energy consumption. A CNN, which can automatically extract relevant spatial features from the VItrajectories, was proposed as a classifier for identifying appliances. The CNN was applied on the PLAID 1 and PLAID 2 dataset resulting in an F-score equal to 78.16% and 66.01%, respectively. The proposed system presents a slightly higher F-score than the CNN work found in the literature. However, the proposed classifier presents a difference of 8% on the total F-score when compared with the model using a Neural Network Ensembles. This difference comes because the neural network ensemble classifies the appliances using the raw current and voltage waveforms as inputs instead of V-I trajectory.
Also, in this paper, the CNN was implemented on hardware. The idea behind this implementation was getting the response of classifier before completing 333 ms (the window size extension). For this implementation, a Zynq board was used. The main advantage of this implementation is to execute the CNN, which is the most computationally intensive part, on an FPGA. The exploration of parallelism to speed up processing was extensively applied. Making a comparison between the developed local identification method and the on-line identification method, the proposed methods is advantageous. The disadvantage of the on-line identification method is the raw data needs to be transferred by internet into the classifier of appliances which is located on the cloud. Conversely, if the proposed method is used, only the answer of system will be sent through the internet. Thus, the compression ratio, r, is defined by: r = Transmission usign raw data Transmission usign the proposed method = f s × window size period × 1 f g Transmission usign the proposed method (15) Now, considering the sampling frequency, f s , of 30 kHz, the window size period of 20 to develop the classifier and the grid frequency, f g , of 60 Hz in USA (50 Hz in EU). If the online identification method is used, 10,000 samples needed to be sent through the internet. Conversely, if the proposed method is used, only the answer of system will be sent through the internet. So, it is expected to reduce the communication bandwidth requirement about 4 orders of magnitude.

Conclusion and Future Work Directions
The present work proposes a strategy to identify appliances through analyzing changes in the voltage and current and deducing what appliances are used in the house as well as their individual energy consumption. A CNN, which can automatically extract relevant spatial features from the VI-trajectories, was proposed as a classifier for identifying appliances. The CNN was applied on the PLAID 1 and PLAID 2 dataset resulting in an F-score equal to 78.16% and 66.01%, respectively. The proposed system presents a slightly higher F-score than the CNN work found in the literature. However, the proposed classifier presents a difference of 8% on the total F-score when compared with the model using a Neural Network Ensembles. This difference comes because the neural network ensemble classifies the appliances using the raw current and voltage waveforms as inputs instead of V-I trajectory.
Also, in this paper, the CNN was implemented on hardware. The idea behind this implementation was getting the response of classifier before completing 333 ms (the window size extension). For this implementation, a Zynq board was used. The main advantage of this implementation is to execute the CNN, which is the most computationally intensive part, on an FPGA. The exploration of parallelism to speed up processing was extensively applied.
In the future, data collection methods are going to be implemented in the processing subsystem, as well as data transfer into the board. This is expected to reduce the communication bandwidth requirement about four orders of magnitude.
Likewise, considering the recent promise of GPU on SoC to rapidly deploys trained deep neural networks and accelerate their inference step, additional work should be conducted towards understanding if such system can become a reliable alternative to FPGAs in this or similar application scenarios.