Advanced Machine Learning Applications in Big Data Analytics

We are currently living in the era of big data. [...]


Introduction
We are currently living in the era of big data. Discovering valuable patterns from big data has become a very hot research topic, which holds immense benefits for governments, businesses, and even individuals. Advanced machine learning models and algorithms have emerged as effective approaches to analyze such data. At the same time, these methods and algorithms are prompting applications in the field of big data.
Considering advanced machine learning and big data together, we have selected a series of relevant works in this special issue to showcase the latest research advancements in this field. Specifically, a total of thirty-three articles are included in this special issue, which can be roughly categorized into six groups: time series analysis, evolutionary computation, pattern recognition, computer vision, image encryption, and others.

Time Series Analysis
Li et al. [1] proposed an integrated model combining bagging and stacking for shorttime traffic-flow prediction. The model incorporates vacation and peak time features, as well as occupancy and speed information. A stacking model with ridge regression as the meta-learner was established and optimized using the bagging model to obtain the Ba-Stacking model. The base learners' information structure was modified by weighting the error coefficients to improve utilization, resulting in a DW-Ba-Stacking model. Experiment results showed that the DW-Ba-Stacking model had the highest prediction accuracy for short-term traffic flow compared with traditional models.
Li et al. [2] proposed a nonlinear integrated forecasting model combining autoregressive and moving average (ARMA), grey system theory model (GM), and backpropagation (BP) model optimized by genetic algorithms (GA) to improve the forecasting accuracy of China coastal bulk coal freight index (CBCFI). The predicted values of ARMA and GM were used as input training samples for the neural network. A genetic algorithm was used to optimize the BP network to better exploit the prediction accuracy of the combined model. The combined ARMA-GM-GABP model was shown to have improved prediction accuracy and can effectively solve the CBCFI forecasting problem.
Wang et al. [3] proposed a new time series classification method called CEEMD-MultiRocket. It combined complementary ensemble empirical mode decomposition (CEEMD) with an improved MultiRocket algorithm to increase classification accuracy. The raw time series was first decomposed into three sub-series using CEEMD. The improved MultiRocket was applied to the raw time series, the selected decomposed sub-series and the first-order difference of the raw time series to generate the final classification results. Experimental results showed that CEEMD-MultiRocket ranked second in classification accuracy on the 109 datasets from the UCR repository against a spread of state-of-the-art TSC models, only behind HIVE-COTE 2.0, but with only 1.4% of the latter's computing load.
Bousbaa et al. [4] proposed an incremental and adaptive strategy using the online stochastic gradient descent algorithm (SGD) and particle swarm optimization metaheuristic (PSO). Two techniques were involved in data stream mining (DSM): adaptive sliding windows and change detection. The study focused on forecasting the value of the Euro in relation to the US dollar. Results showed that the flexible sliding window proved its ability to forecast the price direction with better accuracy compared to using a fixed sliding window.
Han et al. [5] proposed a model named LST-GCN to improve the accuracy of traffic flow predictions. They simulated spatiotemporal correlations by optimizing GCN parameters using an LSTM network. This method improved the traditional method of combining recurrent neural networks and graph neural networks in spatiotemporal traffic flow prediction. Experiments conducted on the PEMS dataset showed that their proposed method was more effective and outperformed other state-of-the-art methods.

Evolutionary Computation
Gao et al. [6] introduced an enhanced slime mould algorithm (MSMA) with a multipopulation strategy and proposed a prediction model based on the modified algorithm and the support vector machine (SVM) algorithm called MSMA-SVM to provide a reference for postgraduate employment decision and policy formulation. The multi-population strategy improved the solution accuracy of the algorithm and the proposed model enhanced the ability to optimize the SVM. Experiments showed that the modified slime mould algorithm had better performance compared to other algorithms and the optimal SVM model had better classification ability and more stable performance for predicting employment stability.
Bao et al. [7] introduced two strategies to address the shortcomings of the butterfly optimization algorithm (BOA): the random replacement strategy and the crisscross search strategy. These strategies were combined to create the random replacement crisscross BOA (RCCBOA). In order to evaluate the performance of RCCBOA, the author conducted comparative experiments with nine other advanced algorithms on the IEEE CEC2014 functional test set, and founded it is effective when combining RCCBOA with support vector machine (SVM) and feature selection (FS).
Zhang et al. [8] proposed an improved matrix particle swarm optimization algorithm (IMPSO) to optimize DNA sequence design. The algorithm incorporated centroid opposition-based learning and a dynamic update based on signal-to-noise ratio distance to search for high-quality solutions. The results showed that the proposed method achieved satisfactory outcomes and higher computational efficiency.
Song et al. [9] developed a multi-strategy adaptive particle swarm optimization (APSO/DU) to accelerate the solving speed of the mean-semivariance (MSV) model . A constraint factor was introduced to control velocity weight and reduce blindness in the search process. A dual-update (DU) strategy was designed based on new speed and position update strategies. The experiment results showed that the APSO/DU algorithm had better convergence accuracy and speed.
Li et al. [10] designed an intelligent prediction model for talent stability in higher education using a kernel extreme learning machine (KELM) and proposed a differential evolution crisscross whale optimization algorithm (DECCWOA) for optimizing the model parameters. The DECCWOA was shown to achieve high accuracy and fast convergence in solving both unimodal and multimodal functions. The DECCWOA was combined with KELM and feature selection (DECCWOA-KELM-FS) to achieve efficient talent stability intelligence prediction for universities or colleges in Wenzhou. The results showed that the performance of the proposed model outperformed other comparative algorithms. The created system can serve as a reliable way to predict higher education talent flows.
Wang et al. [11] proposed a new algorithm called SEGDE to solve the capacitated vehicle routing problem (CVRP). It combined the saving mileage algorithm (SMA), sequential encoding (SE), and gravitational search algorithm (GSA) to address the problems of the differential evolution (DE) algorithm. The SMA was used to initialize the population of the DE. The SE approach was used to adjust the differential mutation strategy. The GSA was applied to adjust the evolutionary search direction and improve search efficiency. Four CVRPs were tested with SEGDE and the results showed that SEGDE effectively solved CVRPs with better performance.

Pattern Recognition
Miu et al. [12] proposed a two-step method to more finely classify the event type of stock announcement news. First, candidate event trigger words and co-occurrence words were extracted and arranged in order of common expressions. Then, final event types were determined using three proposed criteria. Based on the real data of the Chinese stock market, this method constructed 54 event types (p = 0.927, f = 0.946), and included some types not discussed in previous studies.
Jia et al. [13] proposed a new hybrid graph network recommendation model called the user multi-behavior graph network (UMBGN) to make full use of multi-behavior user-interaction information. This model used a joint learning mechanism to integrate user-item multi-behavior interaction sequences and a user multi-behavior informationaware layer was designed to focus on the long-term multi-behavior features of users and learn temporally ordered user-item interaction information through BiGRU and AUGRU units. Experiments on three public datasets showed that this model outperformed the best baselines.
Fatehi et al. [14] investigated the effectiveness of adversarial attacks on clinical document classification and proposed a defense mechanism to develop a robust neural network (CNN) model and counteract these attacks. Various black-box attacks based on concatenation and editing adversaries were applied on unstructured clinical text. A defense technique based on feature selection and filtering was proposed to improve the robustness of the models. Experimental results showed that small perturbations caused a significant drop in performance and the proposed defense mechanism avoided this drop and enhanced the robustness of the CNN model for clinical document classification.
Yin et al. [15] proposed an improved hierarchical clustering algorithm called PRI-MFC to solve the problems of traditional hierarchical clustering algorithms. The algorithm was tested on artificial and real datasets and the experimental results showed superiority in clustering effect, quality, and time consumption.
Yang et al. [16] proposed an intelligent fault diagnosis method for bearings based on variational mode decomposition (VMD), composite multi-scale dispersion entropy (CMDE), and deep belief network (DBN) with particle swarm optimization (PSO) algorithm. The number of modal components decomposed by VMD was determined by the observation center frequency and reconstructed according to the kurtosis. The CMDE of the reconstructed signal was calculated to form training and test samples for pattern recognition. PSO was used to optimize the parameters of the DBN model for fault identification. Through experiment comparison, it was proved that the VMD-CMDE-PSO-DBN method had application value in intelligent fault diagnosis.
Chen et al. [17] proposed an improved least squares support vector machines method to solve the problem of the abnormality or loss of quick access recorder (QAR) data. This method used the entropy weight method to obtain index weights, principal component analysis for dimensionality reduction, and LS-SVM for data fitting and repair. The method was tested using QAR data from multiple real plateau flights and showed high accuracy and fit degree. This proved that the improved least squares support vector machines machine learning model could effectively fit and supplement missing QAR data in the plateau area through historical flight data.
Yu et al. [18] proposed a novel hierarchical heterogeneous graph attention network to model global semantic relations among nodes for emotion-cause pair extraction (ECPE). This method introduced all types of semantic elements involved in ECPE. A pair-level subgraph was constructed to explore the correlation between pair nodes and their different neighboring nodes. Two-level heterogeneous graph attention networks were used to achieve representation learning of clauses and clause pairs. Experiments on benchmark datasets showed that this proposed model achieved significant improvement over 13 compared methods.

Computer Vision
Fan et al. [19] proposed an infrared vehicle target detection algorithm based on an improved version of YOLOv5. The algorithm used the DenseBlock module to increase shallow feature extraction ability, and the Ghost convolution layer replaced the ordinary convolution layer to improve network feature extraction ability. The detection accuracy of the whole network was enhanced by adding a channel attention mechanism and modifying the loss function. Experimental results showed that the addition of DenseBlock and EIOU modules alone improved detection accuracy by 2.5% and 3%, respectively, compared to the original YOLOv5 algorithm. The combination of DenseBlock and Ghost convolution had the best effect, and when adding three modules at the same time, the mAP fluctuation was smaller, reaching 73.1%, which was 4.6% higher than the original YOLOv5 algorithm.
Guerrero-Ibañez et al. [20] proposed a model based on convolutional neural networks to identify and classify tomato leaf diseases using a public dataset and photographs taken in the fields to improve crop yields. Generative adversarial networks were used to avoid overfitting. The proposed model achieved an accuracy greater than 99% in detecting and classifying diseases in tomato leaves.
Zhang et al. [21] proposed a Hemerocallis citrina Baroni maturity detection method based on a deep learning algorithm, called the GGSC YOLOv5 algorithm. This method integrated a lightweight neural network and dual attention mechanism. The improved GGSC YOLOv5 algorithm reduced the number of parameters and Flops by 63.58% and 68.95%, respectively, and reduced the number of network layers by about 33.12% in terms of model structure. The detection precision was up to 84.9%, an improvement of about 2.55%, and the real-time detection speed increased from 64.16 FPS to 96.96 FPS.
Chen et al. [22] proposed a method for detecting abnormal pilot behavior during flight based on an improved YOLOv4 deep learning algorithm and an attention mechanism. The CBAM attention mechanism was introduced to improve the feature extraction capability of the deep neural network. The improved YOLOv4 recognition rate was significantly higher than the unimproved algorithm. The experimental results showed that the improved YOLOv4 had a high mAP, accuracy, and recall rate.
Jin et al. [23] proposed a quantum dynamic optimization algorithm called quantum dynamic neural architecture search (QDNAS) to find the optimal structure for a candidate network. The proposed QDNAS viewed the iterative evolution of the optimization over time as a quantum dynamic process. Experiments on four benchmarks showed that QDNAS was consistently better than all baseline methods in image classification tasks.
Yue et al. [24] designed a detection algorithm called TP-ODA for border patrol object detection. This algorithm improved the detection frame imbalance problem and optimized the feature fusion module of the algorithm with the PDOEM structure. The TP-ODA algorithm was tested on the Border Patrol object dataset BDP and showed improvement in mAP, GFLOPs, model volume, and FPS compared to the baseline model.
Ye et al. [25] proposed an innovative classification method for hyperspectral remote sensing images (HRSIs) called IPCEHRIC, which utilized the advantages of enhanced PSO algorithm, convolutional neural network (CNN), and extreme learning machine (ELM). Experiment conducted on Pavia University data and actual HRSIs after Jiuzhaigou 7.0 earthquake, and results showed that IPCEHRIC could accurately classify these data with stronger generalization, faster learning ability, and higher classification accuracy.

Image Encryption
Huang et al. [26] proposed a polymorphic mapping-coupled map lattice with information entropy for encrypting color images, improving the traditional one-dimensionalmapping coupled lattice.The original 4x4 matrix was extended and a new pixel-level substitution method was proposed using the huffman idea. The idea of polymorphism was employed and the pseudo-random sequence was diversified and homogenized. Experiments were conducted on three plaintext color images, "Lena", "Peppers" and "Mandrill", and the results showed that the algorithm had a large key space, better sensitivity to keys and plaintext images, and a better encryption effect.
Chen et al. [27] proposed a new digital image encryption algorithm based on the splicing model and 1D secondary chaotic system. The algorithm divided the plain image into four sub-parts using quaternary coding, which could be coded separately. The key space was big enough to resist exhaustive attacks due to the use of a 1D quadratic chaotic system. Experimental results showed that the algorithm had high security and a good encryption effect.

Others
Muntean et al. [28] proposed a methodological framework based on design science research for designing and developing data and information artifacts in data analysis projects. They applied several classification algorithms to previously labeled datasets through clustering and introduced a set of metrics to evaluate the performance of classifiers. Their proposed framework can be used for any data analysis problem that involves machine learning techniques.
Zheng et al. [29] proposed a novel KNN-based consensus algorithm that classified transactions based on their priority. The KNN algorithm calculated the distance between transactions based on factors that impacted their priority. Experimental results obtained by adopting the enhanced consensus algorithm showed that the service level agreement(SLA) was better satisfied in the BaaS systems.
Liu et al. [30] proposed a coordinated output strategy for peak shaving and frequency regulation using existing energy storage to improve its economic development and benefits in industrial parks. The strategy included profit and cost models, an economic optimization model for dividing peak shaving and frequency regulation capacity, and an intra-day model predictive control method for rolling optimization. The experimental results showed a 10.96% reduction in daily electricity costs using this strategy.
Hussain et al. [31] presented a COVID-19 warning system based on a machine learning time series model using confirmed, detected, recovered, and death case data. The author compared the performanceof long short-term memory (LSTM), auto-regressive (AR), PROPHET and autoregressive integrated moving average (ARIMA) models for predicting patients' confirmed, and found the PROPHET and AR models had low error rates in predicting positive cases.
Xie et al. [32] presented an effective solution for the problem of confidentiality management of digital archives on the cloud. The basic concept involved setting up a local server between the cloud and each client of an archive system to run a confidentiality management model of digital archives on the cloud. This model included an archive release model and an archive search model.The archive release model encrypted archive files and generated feature data for the archive data. The archive search model transformed query operations on the archive data submitted by a searcher. Both theoretical analysis and experimental evaluation demonstrated the good performance of the proposed solution.
Providence et al. [33] discussed the influence of temporal and spatial normalization modules on multi-variate time series forecasts. The study encompassed various neural networks and their applications. Extensive experimental work on three datasets showed that adding more normalization components could greatly improve the effectiveness of canonical frameworks.

Future Directions
We believe that advanced machine learning and big data will continue to develop. On one hand, advanced machine learning algorithms will discover more valuable patterns from big data, thereby fueling the emergence of new applications for big data. On the other hand, the constantly increasing volume of big data has raised higher demands for advanced machine learning, leading to the development of more effective and efficient machine learning algorithms. Therefore, developing new machine learning algorithms for big data analysis and expanding the application scenarios of big data are important research directions in the future.