Calculation Method of Theoretical Line Loss in Low-Voltage Grids Based on Improved Random Forest Algorithm

: Theoretical line loss rate is the basic reference value of the line loss management of low-voltage grids, but it is difﬁcult to calculate accurately because of the incomplete or abnormal line impedance and measurement parameters. The traditional algorithm will greatly reduce the number of samples that can be used for model training by discarding problematic samples, which will restrict the accuracy of model training. Therefore, an improved random forest method is proposed to calculate and analyze the theoretical line loss of low-voltage grids. According to the Inﬂuence mechanism and data samples analysis, the electrical characteristic indicator system of the theoretical line loss can be constructed, and the concept of power supply torque was proposed for the ﬁrst time. Based on this, the attribute division process of decision tree model is optimized, which can improve the limitation of the high requirement of random forest on the integrity of feature data. Finally, the improved effect of the proposed method is veriﬁed by 23,754 low-voltage grids, and it has a better accuracy under the condition of missing a large number of samples


Introduction
According to the data from latest analysis report on China's electric power development, line loss of the power system is gradually getting lower--from 5.62% in 2020 to 5.26% in 2021, which is due to technological advancement. However, there is still a relatively large gap from developed countries with an average line loss rate of 4%. With consideration of the line loss from low-voltage grids constituting about 40% of the overall power system, there is great practical significance to studying how to manage power systems scientifically and save energy by accurately and effectively digging out energy-loss and thereby reducing the potential for low-voltage grids.
The theoretical line loss in low-voltage grids is defined and determined by the technical conditions of power grid equipment, including the loss of overhead and cable line, the loss of capacitors, reactors, cameras and other auxiliary equipment, plus the loss of voltage, current transducers, energy meters, etc. Compared to some human management factors, theoretical line loss is relatively stable and the energy-loss reducing potential can be detected by the difference between the theoretical and the actual line loss. Due to complex wiring and the huge difference of line length and consumption load in low-voltage grids, traditional methods including equivalent resistance method and power flow algorithm cannot be satisfactorily applied, and the proposal of some improvements are required. For example, [1] improved the calculation of load curve shape coefficient, copper loss, small (many) power supply, and branch power, whilst [2] introduced the average time of current loss and the weakening of hypothesis conditions to improve the accuracy. A network loss calculation method based on user meter power has been proposed, but it is highly dependent on accurate information such as network structure, line type, and

•
The reasonable characteristic factors of the low-voltage grids are constructed according to the physical and operational characteristics. The concept of power supply torque is proposed for the first time.

•
The random forest algorithm is improved by modifying the property classifying process of decision tree and optimizing the weight factor allocation method when data is missing. The problems of the high characteristic data integrity requirement is solved and the accuracy of the model is improved when a large number of samples are missing. Table 1 shows the comparison of the proposed method with existing methods. The detailed algorithms will be given in the following sections.

Accuracy of the Model
Ref. [5] × × √ High Low High Ref. [14] × √ × High High High Refs. [16,17] √ √ × Low Moderate Low Ref. [19] √ √ √ Moderate Moderate Moderate Refs. [20,21] √ √ × Moderate High Moderate Proposed method √ √ √ Low Low High The remainder of this paper is organized as follows. The reasonable characteristic factors of the low-voltage grids are constructed in Section 2. Section 3 presents the algorithm to calculate the theoretical line loss with random forest and improved the adaptability of the model by modifying the property classifying process of decision tree. Section 4 demonstrates the test results for the proposed method validation. Section 5 provides the conclusion of this paper.

Influence Mechanism of Theoretical Line Loss
In order to improve the performance of model training, characteristic factors should be constructed according to the influence mechanism of theoretical line loss in low-voltage grids. In this paper, equivalent resistance method with relatively high calculation accuracy is selected to study and analyze the influence mechanism of theoretical line loss in the low-voltage grids. Its basic principle is shown in Figure 1. Through the simplified circuit model, the grids are assumed to be an equivalent resistance R eql , and the electric energy loss generated when the total current of the line I av flowing through this resistance is equal to the sum of the power loss generated by the resistance of each branch line R i (i = 1,2,...n).
According to the equivalent resistance method, the calculation equation of theoretical line loss in the low-voltage grids can be expressed as: where ∆A (kWh) is the energy loss in a low voltage grid, N is the structure parameters of the grids and varies according to the wiring mode, k is the load shape factor, I av is the average current at the secondary side of transformer, R eql (Ω) is the equivalent resistance of low-voltage grids, K b is the three phase unbalanced coefficient, t (h) is the time of operation, D is the annual calendar days, m i is the number of electric meters, ∆A dbi (kWh) is the monthly energy loss of electric meters, and ∆A C (kWh) is the energy loss of reactive-load compensation equipment.
where ΔA (kWh) is the energy loss in a low voltage grid, N is the structure parameters of the grids and varies according to the wiring mode, k is the load shape factor, Iav is the average current at the secondary side of transformer, Reql (Ω) is the equivalent resistance of low-voltage grids, Kb is the three phase unbalanced coefficient, t (h) is the time of operation, D is the annual calendar days, mi is the number of electric meters, ΔAdbi (kWh) is the monthly energy loss of electric meters, and ΔAc (kWh) is the energy loss of reactiveload compensation equipment. ... As can be seen from Equation (1), the theoretical line loss in the low-voltage grids is composed of the loss of the line, the loss of the electric meter, and the loss of the reactive power compensation equipment. The electric meter loss and the reactive power compensation equipment loss are relatively stable, and the line loss in the station area plays an important role in the theoretical line loss of the grids. The main factors affecting the line loss of the station can be subdivided into two categories: static line factors and dynamic operation factors.

Qualitative Influence Analysis of the Factors
(1) Static line factors Static line factors mainly include line length and type that affect the equivalent impedance Reql of the low-voltage grids. When the line length becomes longer and the electrical resistivity of the selected line increases, the equivalent resistance Reql of the grids will enlarge and the line loss will increase relatively. Figure 2 shows the investigation of the relationship between line length and line loss in a certain place and there is a relatively positive correlation between the two parameters.  As can be seen from Equation (1), the theoretical line loss in the low-voltage grids is composed of the loss of the line, the loss of the electric meter, and the loss of the reactive power compensation equipment. The electric meter loss and the reactive power compensation equipment loss are relatively stable, and the line loss in the station area plays an important role in the theoretical line loss of the grids. The main factors affecting the line loss of the station can be subdivided into two categories: static line factors and dynamic operation factors.

Qualitative Influence Analysis of the Factors
(1) Static line factors Static line factors mainly include line length and type that affect the equivalent impedance R eql of the low-voltage grids. When the line length becomes longer and the electrical resistivity of the selected line increases, the equivalent resistance R eql of the grids will enlarge and the line loss will increase relatively. Figure 2 shows the investigation of the relationship between line length and line loss in a certain place and there is a relatively positive correlation between the two parameters.
where ΔA (kWh) is the energy loss in a low voltage grid, N is the structure parameters of the grids and varies according to the wiring mode, k is the load shape factor, Iav is the average current at the secondary side of transformer, Reql (Ω) is the equivalent resistance of low-voltage grids, Kb is the three phase unbalanced coefficient, t (h) is the time of operation, D is the annual calendar days, mi is the number of electric meters, ΔAdbi (kWh) is the monthly energy loss of electric meters, and ΔAc (kWh) is the energy loss of reactiveload compensation equipment.  As can be seen from Equation (1), the theoretical line loss in the low-voltage grids is composed of the loss of the line, the loss of the electric meter, and the loss of the reactive power compensation equipment. The electric meter loss and the reactive power compensation equipment loss are relatively stable, and the line loss in the station area plays an important role in the theoretical line loss of the grids. The main factors affecting the line loss of the station can be subdivided into two categories: static line factors and dynamic operation factors.

Qualitative Influence Analysis of the Factors
(1) Static line factors Static line factors mainly include line length and type that affect the equivalent impedance Reql of the low-voltage grids. When the line length becomes longer and the electrical resistivity of the selected line increases, the equivalent resistance Reql of the grids will enlarge and the line loss will increase relatively. Figure 2 shows the investigation of the relationship between line length and line loss in a certain place and there is a relatively positive correlation between the two parameters.   (2) Dynamic operation factors Dynamic operation factors mainly include level of electric load, fluctuation characteristics of electric load, and degree of three phase unbalance. With the increase in the level of electric load, fluctuation characteristics of electric load and degree of three phase unbalance, the line loss will increase correspondingly. Figure 3 shows the investigation of the relationship between the average current at the secondary side of the transformer and line loss in a certain place, and there is a obviously a positive correlation between the two parameters.
(2) Dynamic operation factors Dynamic operation factors mainly include level of electric load, fluctuation characteristics of electric load, and degree of three phase unbalance. With the increase in the level of electric load, fluctuation characteristics of electric load and degree of three phase unbalance, the line loss will increase correspondingly. Figure 3 shows the investigation of the relationship between the average current at the secondary side of the transformer and line loss in a certain place, and there is a obviously a positive correlation between the two parameters. (

3) Fusion factors
Since the static line factors are relatively separated from the dynamic operation factors, in order to characterize and analyze the influence mechanism more accurately, Figure 4 shows the concept of power supply torque, which is proposed for the first time in this paper by coupling the line factors and operation factors in the low-voltage grids and referring to the concept of torque in mechanics. The specific definition is shown in Equation (2), which is the average product of the power supply distance and the average daily electricity consumption of all users in the low-voltage grids: (2) where Me is the supply torque, Pi is the power of consumer i, n is the amount of consumers in one low voltage grid, and Di is the distance between consumer i and the transformer of the grid.

(3) Fusion factors
Since the static line factors are relatively separated from the dynamic operation factors, in order to characterize and analyze the influence mechanism more accurately, Figure 4 shows the concept of power supply torque, which is proposed for the first time in this paper by coupling the line factors and operation factors in the low-voltage grids and referring to the concept of torque in mechanics. The specific definition is shown in Equation (2), which is the average product of the power supply distance and the average daily electricity consumption of all users in the low-voltage grids: where M e is the supply torque, P i is the power of consumer i, n is the amount of consumers in one low voltage grid, and D i is the distance between consumer i and the transformer of the grid.

Construction of Characteristic Factors
Considering all the factors affecting the theoretical line loss of low-voltage grids in Section 2.2, the following eight characteristic factors are selected in this paper, including power supply radius 1 X , total line length of low-voltage grids 2 X , amount of consumers in a low voltage grid X , load rate of low-voltage grids X , three-phase unbalance

Construction of Characteristic Factors
Considering all the factors affecting the theoretical line loss of low-voltage grids in Section 2.2, the following eight characteristic factors are selected in this paper, including power supply radius X 1 , total line length of low-voltage grids X 2 , amount of consumers in a low voltage grid X 3 , load rate of low-voltage grids X 4 , three-phase unbalance degree X 5 , load shape factor X 6 , power factor X 7 , and power supply torque X 8 . Detailed definitions of the various factors mentioned above are listed in Table 2. Table 2. Characteristic factors for theoretical line loss in low-voltage grids.

Characteristic Factor Characteristic Definition
Power supply radius X 1 Physical distance from furthest load point to distribution transformer Total line length X 2 Sum of total low-voltage line length in low-voltage grids User numbers X 3 Total user numbers in low-voltage grids, including single-phase users and three-phase users Load rate X 4 Ratio of power consumption capacity to rating capacity of distribution transformer Three-phase unbalance degree X 5 Unbalance degree of three-phase current in three-phase power system, which is the relative deviation between the maximum and the mean value of the three-phase current in a transformer Load shape factor X 6 Ratio of daily current RMS value to average value on distribution transformer side in low-voltage grids Transformer daily average power factor X 7 Ratio of active power to apparent power in low-voltage grids power supply torque X 8 Multiplication of average power supply distance of low-voltage grids load and user average power consumption capacity

Model Based on Traditional Random Forest
Random forest [23][24][25][26][27] is an integrated learning algorithm based on a decision tree. As shown in Figure 5, multiple samples are extracted from original sample sets by Bootstrap sampling method. The decision tree model is established according to each sample, and predictions from multiple decision trees are combined to get the final results by voting mechanism. Instead of setting aside an additional portion of the test set for model evaluation, Out-of-Bag (OOB) data can be used to do the evaluation, and the In-Bag data is used to do the model training.  Assume original theoretical line loss dataset from low-voltage grids and A corresponding characteristic factors dataset is S = {(x 1 , y 1 ), (x 2 , y 2 ), . . . . . ., (x m , y m )}. Assume the electrical characteristic index data set is x = {X 1 , X 2 , . . . . . ., X n }. Assume theoretical line loss set is y and sample quantity is m, while sample property dimension is n and individual learner quantity of decision tree is k. The process of random forest algorithm implementation is as below: (1) Extract m data samples randomly from original datasets and repeat the process for k times, then k sets of training datasets are obtained. (2) Input corresponding data set into each decision tree and select classifying property for each node of the decision tree. Randomly select a subset including d properties from n data properties, and then choose the optimum classifying property from the subset. Normally, d equals to the integer closest to log 2 n. Considering the decision tree algorithm is adopted for an individual learner in the random forest algorithm, the learning capability of the random forest algorithm is contingent on the performance of the decision tree. The implementation steps are described below: If label values of all the data in S are the same, the decision tree including only one node is generated and the node value is the same as the label value.
If A is empty or all the data in S have the same value in A, the decision tree including only one node is then generated, and the node value is the same as the label value belonging to most of the data samples in S.
Traverse all the values of A i , and form dataset S v including all the data with value of If S v is empty, mark S v as node, and the node value is the same as the label value belonging to most of the data samples in S.
If S v is not empty, treat S v as input dataset and A\{A i } as property set. Repeat steps (a)~(e) until a decision tree is generated.
(3) Average strategy can be applied for regression. All the output values from the decision tree are averaged as final output value. Voting strategy can be applied for classification. Compare all the output classified values from the decision tree and take the one with most votes as the final output value. (4) Based on historical documentations and measurement data of various low-voltage grids from the consumption data collection system, marketing system, production management system, and geographic information system, the characteristic factor data can be calculated for each low-voltage grid using definition and the calculation principle of various characteristic factors. Meanwhile, abnormal characteristic data should be cleaned. Feed the cleaned sample data into a random forest algorithm for training and establish a theoretical line loss model of low-voltage grids. Finally, finish detailed theoretical line loss calculation with the established model. A detailed algorithm flow chart is given in Figure 6.

Improved Random Forest Algorithm
ID 3, ID 4.5, and CART decision trees are the most commonly used decision tree algorithms at present. Considering that ID3 is easy to overfit and ID 4.5 is relatively complex, it is difficult to realize model training with a large number of samples. Finally, the CART decision tree is selected based on the calculation principle and calculation requirement of theoretical line loss. The equation below is utilized to select the optimum classify property: Gini(S) is Gini index of S and Gini(S, A i ) represents the change of Gini index before and after S being classified by A i . C means there are C types of data samples in datasets, and p k is the ratio of the k type of samples. According to the maximum criteria of Gini(S, A i ), the CART decision tree selects the corresponding A i as classify property. However, when there is a missing value in D, Equations (3) and (4) cannot be applied directly. If forced cleaning of missing samples is implemented, massive sample data loss might occur and model accuracy will be influenced.

Improved Random Forest Algorithm
ID 3, ID 4.5, and CART decision trees are the most commonly used decision tree algorithms at present. Considering that ID3 is easy to overfit and ID 4.5 is relatively complex, it is difficult to realize model training with a large number of samples. Finally, the CART decision tree is selected based on the calculation principle and calculation requirement of theoretical line loss. The equation below is utilized to select the optimum classify property: Therefore, optimization and modification for the basic CART decision tree are considered. Assume S i is a subset of S organized by evaluation samples without missing values in A i . There are K types of data samples and S k ic (k = 1, 2, ..., K) is the k type of subset in S i . S i has V values in A i and they are Define weight W j , j = 1, 2, . . . . . ., m for each evaluation data sample X j and define p k i is the ratio of the kth type of data sample in S i and r v i is the ratio of A i with value of A k i in S i . Based on above definitions, Equations (3) and (4) can be modified to: Gini With consideration of the sample missing, define the influence factor of weight W i : W i represents the weight of samples without missing values in total samples. In summary, when there are missing values in the evaluation data, the selection equation of the decision tree classify property can be modified to With this modification, the previous equation for optimum classify property selection can also be used when there is a missing value in the evaluation data. Meanwhile, the influence of optimum classify property selection can be comprehensively considered with data samples having missing values or not. The topological structure of improved random forest by modifying decision tree is shown in Figure 7.

Evaluation of the Algorithm
For the evaluation of the algorithm, the training samples are firstly put into the algorithm, and the corresponding model parameters are trained, including the number of leaf nodes and the number of decision trees in random forest. Furthermore, the root mean square error (RMSE) of test samples is used to measure the accuracy of the model. RMSE is the residual sum of squares of all calculated values, followed by the square root, which is used to indicate the accuracy of the calculated values. It can be expressed by Equation (11): (11) Where n is number of test samples, is the calculated value of the model, and is observed value. The closer the RMSE is to 0, the higher the accuracy of the model.
Meanwhile, the overall calculation can be measured by the distribution of calculated values and the observed values of test samples, fitting the linear relationship between them and measuring the correlation coefficient between calculated values and observed values. Theoretically, the coefficient is between −1 and 1. The closer the coefficient is to 1, the better the linear relationship between the calculated value and the observed value, the smaller the overall difference between the two values, and the better the calculation accuracy of the model.

Evaluation of the Algorithm
For the evaluation of the algorithm, the training samples are firstly put into the algorithm, and the corresponding model parameters are trained, including the number of leaf nodes and the number of decision trees in random forest. Furthermore, the root mean square error (RMSE) of test samples is used to measure the accuracy of the model.
RMSE is the residual sum of squares of all calculated values, followed by the square root, which is used to indicate the accuracy of the calculated values. It can be expressed by Equation (11): Where n is number of test samples, y c i is the calculated value of the model, and y o i is observed value. The closer the RMSE is to 0, the higher the accuracy of the model.
Meanwhile, the overall calculation can be measured by the distribution of calculated values and the observed values of test samples, fitting the linear relationship between them and measuring the correlation coefficient between calculated values and observed values. Theoretically, the coefficient is between −1 and 1. The closer the coefficient is to 1, the better the linear relationship between the calculated value and the observed value, the smaller the overall difference between the two values, and the better the calculation accuracy of the model.

Data Preparation
Take the line loss calculation of 23,754 low-voltage grids in a specific area as an example. Characteristic factors of each low-voltage grid are calculated once a day and the derived results are treated as one sample record. In the end, 166,283 samples are accumulated from 7 continuous days. After cleaning, these samples will be divided into two parts: one part is the training sample, accounting for 80%, and the other part is the test sample, accounting for 20%. The accuracy of the model was evaluated by the RMSE of the test sample.
During the process of abnormal data cleaning, it is found that the calculated data of characteristic factors present lots of abnormalities, due to data collection issues and nonuniform data quality. Furthermore, factors like power supply radius and low voltage line length are missing in some low-voltage grids due to different documentation management levels in different grids. Therefore, a reasonable range of various characteristic factors is formed with considerations of electrical characteristic calculation principles and lowvoltage grid design regulations. Meanwhile, abnormal sample data out of range are cleaned. The cleaning and screening conditions of each characteristic factor are shown in Table 3. After cleaning, 6708 data samples are retained, which only constitute 4.03% of original low-voltage grid samples. This is due to massive sample abnormalities and missing data caused by data management level issues in the original sample sets.

Analysis Based on Traditional Random Forest in High Cleaning Rate
Feed the cleaned sample data as original data into the random forest algorithm model. Select the numbers of leaf nodes and decision trees in the random forest model with RMSE as the evaluation criteria. Seen from the figure, when the number of decision trees increases, the root mean square error of the model becomes smaller, and the decreasing trend gradually slows down, but the complexity of the model does so only gradually. Meanwhile, when the number of decision trees are the same, the RMSE of the model will become smaller with fewer leaf nodes. Finally, considering the accuracy of the model and the complexity of the model, the optimal number of leaf nodes and decision trees are selected. In this situation, the final leaf nodes' number is 5 and the decision tree's number is 85, while optimum RMSE is 1.3239, as shown in Figure 8. Figure 9 shows the distribution of the model calculated value and the observed value of the test samples, with the relatively obvious linear correlation between the model calculated value and the observed value shown. The correlation coefficient is only 0.4522. The experimental results demonstrate that the prediction accuracy is acceptable. Excellent calculation accuracy is obtained in a high distribution density range of [1,5], while calculation accuracy is low in a low distribution density range of [5,8]. This results from massive samples being removed in the low distribution density range during the process of force cleaning. trees increases, the root mean square error of the model becomes smaller, and the decrea ing trend gradually slows down, but the complexity of the model does so only graduall Meanwhile, when the number of decision trees are the same, the RMSE of the model w become smaller with fewer leaf nodes. Finally, considering the accuracy of the model an the complexity of the model, the optimal number of leaf nodes and decision trees are s lected. In this situation, the final leaf nodes' number is 5 and the decision tree's number 85, while optimum RMSE is 1.3239, as shown in Figure 8.  The experimental results demonstrate that the prediction accuracy is acceptable. Excellen calculation accuracy is obtained in a high distribution density range of [1,5], while calcu lation accuracy is low in a low distribution density range of [5,8]. This results from mas sive samples being removed in the low distribution density range during the process of force cleaning.

Analysis Based on Traditional Random Forest in Lower Cleaning Rate
Therefore, samples need to be retained as much as possible considering the calcula tion results of various characteristic factors. As in Figure 10, by analyzing the distribution of various characteristic factors, it is found that the line length of 17% low-voltage grids is zero, which obviously is not true. Meanwhile, a power factor of 33% low-voltage grids is zero while load shape factor of 14.75% low-voltage grids is also zero, which are agains practical operation rules of low-voltage grids. A three-phase unbalance degree of 4.5% low-voltage grids is, besides, actually higher than 200, which does not comply with the calculation principle of the three-phase unbalance degree in low-voltage grids.

Analysis Based on Traditional Random Forest in Lower Cleaning Rate
Therefore, samples need to be retained as much as possible considering the calculation results of various characteristic factors. As in Figure 10, by analyzing the distribution of various characteristic factors, it is found that the line length of 17% low-voltage grids is zero, which obviously is not true. Meanwhile, a power factor of 33% low-voltage grids is zero while load shape factor of 14.75% low-voltage grids is also zero, which are against practical operation rules of low-voltage grids. A three-phase unbalance degree of 4.5% low-voltage grids is, besides, actually higher than 200, which does not comply with the calculation principle of the three-phase unbalance degree in low-voltage grids.
Although data missing and abnormality exist in the mass of sample sets, much useful information disappears if a forced cleaning strategy is implemented, and model accuracy is influenced. By removing abnormal samples not complying with characteristic calculation principles and preserving those samples with partial properties, 89,067 of data samples are retained, which constitute 53.56% of the total samples. The cleaned sample data are fed into the modified random forest algorithm model. As shown in Figure 11, the leaf node number is set to 5, while the number of decision trees is 110 and optimum RMSE is 1.7319. Model fitting error is much higher than when the forced cleaning strategy is used.
As shown in Figure 12, from the distribution diagram of the model calculated value and the observed value of test samples, a certain linear correlation between the model calculated value and the observed value is shown. The correlation coefficient is only 0.4166, representing the relatively low accuracy of the model prediction. Good calculation accuracy is achieved in the high distribution density range of [1,4] and a lack of the minimum Energies 2023, 16, 2971 12 of 16 accuracy in the range of [0, 1] and [4,8]. The experimental results above demonstrate that the fitting accuracy in the low distribution density range is not improved when the cleaning rate of abnormal samples drops. On the contrary, the performance of the model is influenced due to the longer training time caused by the larger training sample size.

Analysis Based on Traditional Random Forest in Lower Cleaning Rate
Therefore, samples need to be retained as much as possible considering the calculation results of various characteristic factors. As in Figure 10, by analyzing the distribution of various characteristic factors, it is found that the line length of 17% low-voltage grids is zero, which obviously is not true. Meanwhile, a power factor of 33% low-voltage grids is zero while load shape factor of 14.75% low-voltage grids is also zero, which are against practical operation rules of low-voltage grids. A three-phase unbalance degree of 4.5% low-voltage grids is, besides, actually higher than 200, which does not comply with the calculation principle of the three-phase unbalance degree in low-voltage grids. Although data missing and abnormality exist in the mass of sample sets, much useful information disappears if a forced cleaning strategy is implemented, and model accuracy is influenced. By removing abnormal samples not complying with characteristic calculation principles and preserving those samples with partial properties, 89,067 of data samples are retained, which constitute 53.56% of the total samples. The cleaned sample data are fed into the modified random forest algorithm model. As shown in Figure 11, the leaf As shown in Figure 12, from the distribution diagram of the model calculated value and the observed value of test samples, a certain linear correlation between the model calculated value and the observed value is shown. The correlation coefficient is only 0.4166, representing the relatively low accuracy of the model prediction. Good calculation accuracy is achieved in the high distribution density range of [1,4] and a lack of the minimum accuracy in the range of [0, 1] and [4,8]. The experimental results above demonstrate that the fitting accuracy in the low distribution density range is not improved when the cleaning rate of abnormal samples drops. On the contrary, the performance of the model is influenced due to the longer training time caused by the larger training sample size. As shown in Figure 12, from the distribution diagram of the model calculated value and the observed value of test samples, a certain linear correlation between the model calculated value and the observed value is shown. The correlation coefficient is only 0.4166, representing the relatively low accuracy of the model prediction. Good calculation accuracy is achieved in the high distribution density range of [1,4] and a lack of the minimum accuracy in the range of [0, 1] and [4,8]. The experimental results above demonstrate that the fitting accuracy in the low distribution density range is not improved when the cleaning rate of abnormal samples drops. On the contrary, the performance of the model is influenced due to the longer training time caused by the larger training sample size.

Analysis Based on Improved Random Forest in Lower Cleaning Rate
When the cleaned sample data as the original data is fed into the modified random forest algorithm model, as shown in Figure 13, the leaf node number is set to 5, while the number of decision trees is 75, and optimum RMSE is 1.2639.
The calculated value and observed value of test samples are distributed in Figure 14.

Analysis Based on Improved Random Forest in Lower Cleaning Rate
When the cleaned sample data as the original data is fed into the modified random forest algorithm model, as shown in Figure 13, the leaf node number is set to 5, while the number of decision trees is 75, and optimum RMSE is 1.2639.

Discussion
As demonstrated by the results, the improved random forest method, by optimizing the decision tree, can help improve the accuracy of the theoretical line loss calculation when samples are missing in large amounts. The results of the abovementioned methods are compared in Table 4. The results can be summarized as follows: (1) According to definitions and calculation principles of electrical characteristics for The calculated value and observed value of test samples are distributed in Figure 14.
Obviously, there is a linear correlation between the model calculated value and the observed value. The correlation coefficient is 0.6733. Excellent linear correlation features appear both in the high distribution density range of [1,5] and the low distribution density range of [5,8]. It is verified that the model prediction performance can meet the accuracy requirement of the theoretical line loss model for low-voltage grids.

Discussion
As demonstrated by the results, the improved random forest method, by optimizin the decision tree, can help improve the accuracy of the theoretical line loss calculatio when samples are missing in large amounts. The results of the abovementioned method are compared in Table 4.

Discussion
As demonstrated by the results, the improved random forest method, by optimizing the decision tree, can help improve the accuracy of the theoretical line loss calculation when samples are missing in large amounts. The results of the abovementioned methods are compared in Table 4. The results can be summarized as follows: (1) According to definitions and calculation principles of electrical characteristics for lowvoltage grids, a reasonable range of various characteristic are formed. The cleaning of abnormal sample data out of range is then performed. With the change of the sample data cleaning rule, model training effects using the random forest algorithm under the cleaning rates of 95.57% and 46.44% are compared. The accuracy errors of the model are 1.3239 and 1.7319, respectively. (2) The issue of the characteristic factor missing using modified random forest algorithm is solved. Furthermore, the model is trained by the modified random algorithm, and model accuracy error is only 1.2161 compared to other approaches when the sample data cleaning rate is 46.44%. (3) Correlation between the model calculated value and the observed value reached 0.6711 when the improved random forest algorithm was used in the situation of a lower sample cleaning rate, which was much higher than the other two situations (0.4522 and 0.4366). Meanwhile, it can also show the good calculation accuracy of improved random forest algorithm in different line loss intervals. (4) More characteristic samples can be preserved when using the modified random forest algorithm to deal with samples featuring characteristics missing than by using forced cleaning. Therefore, better accuracy can be obtained during model training and calculation, which demonstrates that it is more effective to calculate and analyze low-voltage grids' theoretical line loss using the method proposed in this paper.

Conclusions
In this paper, an improved random forest method was proposed for the calculation of theoretical line loss in low-voltage grids. The main work of this paper included the following: (1) The reasonable electric characteristic factors of the low-voltage grids were constructed according to the physical and operational influencing mechanism of theoretical line loss. The concept of power supply torque was proposed for the first time to analyze the influence mechanism more accurately by coupling the physical factors and the operational factors. (2) The random forest algorithm was improved by modifying the property classifying process of the decision tree and optimizing the weight factor allocation method when sample data is missing. The problems of a high characteristic data integrity requirement was solved and the accuracy of the model was improved when a large amount of samples are missing. When the sample data cleaning rate changes from 95.57% to 46.44%, the accuracy of the traditional random forest increases from 1.3239 to 1.7319. However, the accuracy error of the improved random forest is only 1.2161 when the classifying process of the decision tree is modified.
We conclude that the proposed method can more accurately calculate the theoretical line loss of low-voltage grids when samples are missing in a large amount due to the different management of basic data in different areas. Further work may aim at optimizing the parameters of the algorithm model according to other data problems constantly found in practical application so as to improve the accuracy of the theoretical line loss and guide the actual loss reduction work more accurately.