MDPI - Publisher of Open Access Journals

17 pages, 1292 KiB

Open AccessArticle

A Hybrid Federated Learning Framework for Privacy-Preserving Near-Real-Time Intrusion Detection in IoT Environments

by Glauco Rampone, Taras Ivaniv and Salvatore Rampone

Electronics 2025, 14(7), 1430; https://doi.org/10.3390/electronics14071430 - 2 Apr 2025

Cited by 1 | Viewed by 2130

The proliferation of Internet of Things (IoT) devices has introduced significant challenges in cybersecurity, particularly in the realm of intrusion detection. While effective, traditional centralized machine learning approaches often compromise data privacy and scalability due to the need for data aggregation. In this [...] Read more.

The proliferation of Internet of Things (IoT) devices has introduced significant challenges in cybersecurity, particularly in the realm of intrusion detection. While effective, traditional centralized machine learning approaches often compromise data privacy and scalability due to the need for data aggregation. In this study, we propose a federated learning framework for near-real-time intrusion detection in IoT environments. Federated learning enables decentralized model training across multiple devices without exchanging raw data, thereby preserving privacy and reducing communication overhead. Our approach builds upon a previously proposed hybrid model, which combines a machine learning model deployed on IoT devices with a second-level cloud-based analysis. This previous work required all data to be passed to the cloud in aggregate form, limiting security. We extend this model to incorporate federated learning, allowing for distributed training while maintaining high accuracy and privacy. We evaluate the performance of our federated-learning-based model against a traditional centralized model, focusing on accuracy retention, training efficiency, and privacy preservation. Our experiments utilize actual attack data partitioned across multiple nodes. The results demonstrate that this hybrid federated learning not only offers significant advantages in terms of data privacy and scalability but also retains the previous competitive accuracy. This paper also explores the integration of federated learning with cloud-based infrastructure, leveraging platforms such as Databricks and Google Cloud Storage. We discuss the challenges and benefits of implementing federated learning in a distributed environment, including the use of Apache Spark and MLlib for scalable model training. The results show that all the algorithms used maintain an excellent identification accuracy (98% for logistic R=regression, 97% for SVM, and 100% for Random Forest). We also report a very short training time (less than 11 s on a single machine). The previous very low application time is also confirmed (0.16 s for over 1,697,851 packets). Our findings highlight the potential of federated learning as a viable solution for enhancing cybersecurity in IoT ecosystems, paving the way for further research in privacy-preserving machine learning techniques. Full article

(This article belongs to the Special Issue Network Security and Cryptography Applications)

► Show Figures

Figure 1

27 pages, 1293 KiB

Open AccessArticle

Optimizing Apache Spark MLlib: Predictive Performance of Large-Scale Models for Big Data Analytics

by Leonidas Theodorakopoulos, Aristeidis Karras and George A. Krimpas

Algorithms 2025, 18(2), 74; https://doi.org/10.3390/a18020074 - 1 Feb 2025

Cited by 17 | Viewed by 1658

Abstract

In this study, we analyze the performance of the machine learning operators in Apache Spark MLlib for K-Means, Random Forest Regression, and Word2Vec. We used a multi-node Spark cluster along with collected detailed execution metrics computed from the data of diverse datasets and [...] Read more.

In this study, we analyze the performance of the machine learning operators in Apache Spark MLlib for K-Means, Random Forest Regression, and Word2Vec. We used a multi-node Spark cluster along with collected detailed execution metrics computed from the data of diverse datasets and parameter settings. The data were used to train predictive models that had up to 98% accuracy in forecasting performance. By building actionable predictive models, our research provides a unique treatment for key hyperparameter tuning, scalability, and real-time resource allocation challenges. Specifically, the practical value of traditional models in optimizing Apache Spark MLlib workflows was shown, achieving up to 30% resource savings and a 25% reduction in processing time. These models enable system optimization, reduce the amount of computational overheads, and boost the overall performance of big data applications. Ultimately, this work not only closes significant gaps in predictive performance modeling, but also paves the way for real-time analytics over a distributed environment. Full article

(This article belongs to the Special Issue Algorithms in Data Classification (2nd Edition))

► Show Figures

Figure 1

19 pages, 6498 KiB

Open AccessArticle

Temporal Association Rule Mining: Race-Based Patterns of Treatment-Adverse Events in Breast Cancer Patients Using SEER–Medicare Dataset

by Nabil Adam and Robert Wieder

Biomedicines 2024, 12(6), 1213; https://doi.org/10.3390/biomedicines12061213 - 29 May 2024

Cited by 1 | Viewed by 1534

Abstract

PURPOSE: Disparities in the screening, treatment, and survival of African American (AA) patients with breast cancer extend to adverse events experienced with systemic therapy. However, data are limited and difficult to obtain. We addressed this challenge by applying temporal association rule (TAR) mining [...] Read more.

PURPOSE: Disparities in the screening, treatment, and survival of African American (AA) patients with breast cancer extend to adverse events experienced with systemic therapy. However, data are limited and difficult to obtain. We addressed this challenge by applying temporal association rule (TAR) mining using the SEER–Medicare dataset for differences in the association of specific adverse events (AEs) and treatments (TRs) for breast cancer between AA and White women. We considered two categories of cancer care providers and settings: practitioners providing care in the outpatient units of hospitals and institutions and private practitioners providing care in their offices. PATIENTS AN METHODS: We considered women enrolled in the Medicare fee-for-service option at age 65 who qualified by age and not disability, who were diagnosed with breast cancer with attributed patient factors of age and race, marital status, comorbidities, prior malignancies, prior therapy, disease factors of stage, grade, and ER/PR and Her2 status and laterality. We included 141 HCPCS drug J codes for chemotherapy, biotherapy, and hormone therapy drugs, which we consolidated into 46 mechanistic categories and generated AE data. We consolidated AEs from ICD9 codes into 18 categories associated with breast cancer therapy. We applied TAR mining to determine associations between the 46 TR and 18 AE categories in the context of the patient categories outlined. We applied the spark.mllib implementation of the FPGrowth algorithm, a parallel version called PFP. We considered differences of at least one unit of lift as significant between groups. The model’s results demonstrated a high overlap between the model’s identified TR-AEs associated set and the actual set. RESULTS: Our results demonstrate that specific TR/AE associations are highly dependent on race, stage, and venue of care administration. CONCLUSIONS: Our data demonstrate the usefulness of this approach in identifying differences in the associations between TRs and AEs in different populations and serve as a reference for predicting the likelihood of AEs in different patient populations treated for breast cancer. Our novel approach using unsupervised learning enables the discovery of association rules while paying special attention to temporal information, resulting in greater predictive and descriptive power as a patient’s health and life status change over time. Full article

(This article belongs to the Special Issue New Insights into Breast Cancer Management: From Tumorigenesis to Personalized Treatments)

► Show Figures

Figure 1

34 pages, 10875 KiB

Open AccessArticle

EverAnalyzer: A Self-Adjustable Big Data Management Platform Exploiting the Hadoop Ecosystem

by Panagiotis Karamolegkos, Argyro Mavrogiorgou, Athanasios Kiourtis and Dimosthenis Kyriazis

Information 2023, 14(2), 93; https://doi.org/10.3390/info14020093 - 3 Feb 2023

Cited by 6 | Viewed by 2709

Abstract

Big Data is a phenomenon that affects today’s world, with new data being generated every second. Today’s enterprises face major challenges from the increasingly diverse data, as well as from indexing, searching, and analyzing such enormous amounts of data. In this context, several [...] Read more.

Big Data is a phenomenon that affects today’s world, with new data being generated every second. Today’s enterprises face major challenges from the increasingly diverse data, as well as from indexing, searching, and analyzing such enormous amounts of data. In this context, several frameworks and libraries for processing and analyzing Big Data exist. Among those frameworks Hadoop MapReduce, Mahout, Spark, and MLlib appear to be the most popular, although it is unclear which of them best suits and performs in various data processing and analysis scenarios. This paper proposes EverAnalyzer, a self-adjustable Big Data management platform built to fill this gap by exploiting all of these frameworks. The platform is able to collect data both in a streaming and in a batch manner, utilizing the metadata obtained from its users’ processing and analytical processes applied to the collected data. Based on this metadata, the platform recommends the optimum framework for the data processing/analytical activities that the users aim to execute. To verify the platform’s efficiency, numerous experiments were carried out using 30 diverse datasets related to various diseases. The results revealed that EverAnalyzer correctly suggested the optimum framework in 80% of the cases, indicating that the platform made the best selections in the majority of the experiments. Full article

(This article belongs to the Special Issue Information for Business and Management–Software Development for Data Processing and Management)

► Show Figures

Figure 1

18 pages, 2930 KiB

Open AccessEditor’s ChoiceArticle

Apache Spark and MLlib-Based Intrusion Detection System or How the Big Data Technologies Can Secure the Data

by Otmane Azeroual and Anastasija Nikiforova

Information 2022, 13(2), 58; https://doi.org/10.3390/info13020058 - 24 Jan 2022

Cited by 37 | Viewed by 8129

Abstract

Since the turn of the millennium, the volume of data has increased significantly in both industries and scientific institutions. The processing of these volumes and variety of data we are dealing with are unlikely to be accomplished with conventional software solutions. Thus, new [...] Read more.

Since the turn of the millennium, the volume of data has increased significantly in both industries and scientific institutions. The processing of these volumes and variety of data we are dealing with are unlikely to be accomplished with conventional software solutions. Thus, new technologies belonging to the big data processing area, able to distribute and process data in a scalable way, are integrated into classical Business Intelligence (BI) systems or replace them. Furthermore, we can benefit from big data technologies to gain knowledge about security, which can be obtained from massive databases. The paper presents a security-relevant data analysis based on the big data analytics engine Apache Spark. A prototype intrusion detection system is developed aimed at detecting data anomalies through machine learning by using the k-means algorithm for clustering analysis implemented in Sparks MLlib. The extraction of features to detect anomalies is currently challenging because the problem of detecting anomalies is not actively and exhaustively monitored. The detection of abnormal data can be effectuated by using relevant data that are already in companies’ and scientific organizations’ possession. Their interpretation and further processing in a continuous manner can sufficiently contribute to anomaly and intrusion detection. Full article

(This article belongs to the Special Issue Big Data, IoT and Cloud Computing)

► Show Figures

Figure 1

10 pages, 1844 KiB

Open AccessArticle

Hydrogen Safety Prediction and Analysis of Hydrogen Refueling Station Leakage Accidents and Process Using Multi-Relevance Machine Learning

by Wujian Yang, Jianghao Dong and Yuke Ren

World Electr. Veh. J. 2021, 12(4), 185; https://doi.org/10.3390/wevj12040185 - 13 Oct 2021

Cited by 4 | Viewed by 3648

Abstract

Hydrogen energy vehicles are being increasingly widely used. To ensure the safety of hydrogenation stations, research into the detection of hydrogen leaks is required. Offline analysis using data machine learning is achieved using Spark SQL and Spark MLlib technology. In this study, to [...] Read more.

Hydrogen energy vehicles are being increasingly widely used. To ensure the safety of hydrogenation stations, research into the detection of hydrogen leaks is required. Offline analysis using data machine learning is achieved using Spark SQL and Spark MLlib technology. In this study, to determine the safety status of a hydrogen refueling station, we used multiple algorithm models to perform calculation and analysis: a multi-source data association prediction algorithm, a random gradient descent algorithm, a deep neural network optimization algorithm, and other algorithm models. We successfully analyzed the data, including the potential relationships, internal relationships, and operation laws between the data, to detect the safety statuses of hydrogen refueling stations. Full article

(This article belongs to the Special Issue Towards Intelligent E-Mobility—Selected Papers from The 34th International Electric Vehicles Symposium and Exhibition (Nanjing, China))

► Show Figures

Figure 1

17 pages, 3611 KiB

Open AccessArticle

A Recommendation Engine for Predicting Movie Ratings Using a Big Data Approach

by Mazhar Javed Awan, Rafia Asad Khan, Haitham Nobanee, Awais Yasin, Syed Muhammad Anwar, Usman Naseem and Vishwa Pratap Singh

Electronics 2021, 10(10), 1215; https://doi.org/10.3390/electronics10101215 - 20 May 2021

Cited by 77 | Viewed by 13834

Abstract

In this era of big data, the amount of video content has dramatically increased with an exponential broadening of video streaming services. Hence, it has become very strenuous for end-users to search for their desired videos. Therefore, to attain an accurate and robust [...] Read more.

In this era of big data, the amount of video content has dramatically increased with an exponential broadening of video streaming services. Hence, it has become very strenuous for end-users to search for their desired videos. Therefore, to attain an accurate and robust clustering of information, a hybrid algorithm was used to introduce a recommender engine with collaborative filtering using Apache Spark and machine learning (ML) libraries. In this study, we implemented a movie recommendation system based on a collaborative filtering approach using the alternating least squared (ALS) model to predict the best-rated movies. Our proposed system uses the last search data of a user regarding movie category and references this to instruct the recommender engine, thereby making a list of predictions for top ratings. The proposed study used a model-based approach of matrix factorization, the ALS algorithm along with a collaborative filtering technique, which solved the cold start, sparse, and scalability problems. In particular, we performed experimental analysis and successfully obtained minimum root mean squared errors (oRMSEs) of 0.8959 to 0.97613, approximately. Moreover, our proposed movie recommendation system showed an accuracy of 97% and predicted the top 1000 ratings for movies. Full article

(This article belongs to the Special Issue Big Data Privacy-Preservation)

► Show Figures

Figure 1

12 pages, 825 KiB

Open AccessArticle

JAMPI: Efficient Matrix Multiplication in Spark Using Barrier Execution Mode

by Tamas Foldi, Chris von Csefalvay and Nicolas A. Perez

Big Data Cogn. Comput. 2020, 4(4), 32; https://doi.org/10.3390/bdcc4040032 - 5 Nov 2020

Cited by 7 | Viewed by 5650

Abstract

The new barrier mode in Apache Spark allows for embedding distributed deep learning training as a Spark stage to simplify the distributed training workflow. In Spark, a task in a stage does not depend on any other tasks in the same stage, and [...] Read more.

The new barrier mode in Apache Spark allows for embedding distributed deep learning training as a Spark stage to simplify the distributed training workflow. In Spark, a task in a stage does not depend on any other tasks in the same stage, and hence it can be scheduled independently. However, several algorithms require more sophisticated inter-task communications, similar to the MPI paradigm. By combining distributed message passing (using asynchronous network IO), OpenJDK’s new auto-vectorization and Spark’s barrier execution mode, we can add non-map/reduce-based algorithms, such as Cannon’s distributed matrix multiplication to Spark. We document an efficient distributed matrix multiplication using Cannon’s algorithm, which significantly improves on the performance of the existing MLlib implementation. Used within a barrier task, the algorithm described herein results in an up to 24% performance increase on a 10,000 × 10,000 square matrix with a significantly lower memory footprint. Applications of efficient matrix multiplication include, among others, accelerating the training and implementation of deep convolutional neural network-based workloads, and thus such efficient algorithms can play a ground-breaking role in the faster and more efficient execution of even the most complicated machine learning tasks. Full article

► Show Figures

Figure 1

17 pages, 931 KiB

Open AccessArticle

Toward Developing Efficient Conv-AE-Based Intrusion Detection System Using Heterogeneous Dataset

by Muhammad Ashfaq Khan and Juntae Kim

Electronics 2020, 9(11), 1771; https://doi.org/10.3390/electronics9111771 - 26 Oct 2020

Cited by 83 | Viewed by 5801

Abstract

Recently, due to the rapid development and remarkable result of deep learning (DL) and machine learning (ML) approaches in various domains for several long-standing artificial intelligence (AI) tasks, there has an extreme interest in applying toward network security too. Nowadays, in the information [...] Read more.

Recently, due to the rapid development and remarkable result of deep learning (DL) and machine learning (ML) approaches in various domains for several long-standing artificial intelligence (AI) tasks, there has an extreme interest in applying toward network security too. Nowadays, in the information communication technology (ICT) era, the intrusion detection (ID) system has the great potential to be the frontier of security against cyberattacks and plays a vital role in achieving network infrastructure and resources. Conventional ID systems are not strong enough to detect advanced malicious threats. Heterogeneity is one of the important features of big data. Thus, designing an efficient ID system using a heterogeneous dataset is a massive research problem. There are several ID datasets openly existing for more research by the cybersecurity researcher community. However, no existing research has shown a detailed performance evaluation of several ML methods on various publicly available ID datasets. Due to the dynamic nature of malicious attacks with continuously changing attack detection methods, ID datasets are available publicly and are updated systematically. In this research, spark MLlib (machine learning library)-based robust classical ML classifiers for anomaly detection and state of the art DL, such as the convolutional-auto encoder (Conv-AE) for misuse attack, is used to develop an efficient and intelligent ID system to detect and classify unpredictable malicious attacks. To measure the effectiveness of our proposed ID system, we have used several important performance metrics, such as FAR, DR, and accuracy, while experiments are conducted on the publicly existing dataset, specifically the contemporary heterogeneous CSE-CIC-IDS2018 dataset. Full article

(This article belongs to the Special Issue Machine Learning Techniques for Intelligent Intrusion Detection Systems)

► Show Figures

Figure 1

4 pages, 170 KiB

Open AccessProceeding Paper

Network Anomaly Detection Using Machine Learning Techniques

by Julio J. Estévez-Pereira, Diego Fernández and Francisco J. Novoa

Proceedings 2020, 54(1), 8; https://doi.org/10.3390/proceedings2020054008 - 19 Aug 2020

Cited by 4 | Viewed by 6863

Abstract

While traditional network security methods have been proven useful until now, the flexibility of machine learning techniques makes them a solid candidate in the current scene of our networks. In this paper, we assess how well the latter are capable of detecting security [...] Read more.

While traditional network security methods have been proven useful until now, the flexibility of machine learning techniques makes them a solid candidate in the current scene of our networks. In this paper, we assess how well the latter are capable of detecting security threats in a corporative network. To that end, we configure and compare several models to find the one which fits better with our needs. Furthermore, we distribute the computational load and storage so we can handle extensive volumes of data. The algorithms that we use to create our models, Random Forest, Naive Bayes, and Deep Neural Networks (DNN), are both divergent and tested in other papers in order to make our comparison richer. For the distribution phase, we operate with Apache Structured Streaming, PySpark, and MLlib. As for the results, it is relevant to mention that our dataset has been found to be effectively modelable with just a reduced number of features. Finally, given the outcomes obtained, we find this line of research encouraging and, therefore, this approach worth pursuing. Full article

(This article belongs to the Proceedings of 3rd XoveTIC Conference)

24 pages, 552 KiB

Open AccessArticle

Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark

by Athanasios Alexopoulos, Georgios Drakopoulos, Andreas Kanavos, Phivos Mylonas and Gerasimos Vonitsanos

Algorithms 2020, 13(3), 71; https://doi.org/10.3390/a13030071 - 24 Mar 2020

Cited by 18 | Viewed by 5366

Abstract

At the dawn of the 10V or big data data era, there are a considerable number of sources such as smart phones, IoT devices, social media, smart city sensors, as well as the health care system, all of which constitute but a small [...] Read more.

At the dawn of the 10V or big data data era, there are a considerable number of sources such as smart phones, IoT devices, social media, smart city sensors, as well as the health care system, all of which constitute but a small portion of the data lakes feeding the entire big data ecosystem. This 10V data growth poses two primary challenges, namely storing and processing. Concerning the latter, new frameworks have been developed including distributed platforms such as the Hadoop ecosystem. Classification is a major machine learning task typically executed on distributed platforms and as a consequence many algorithmic techniques have been developed tailored for these platforms. This article extensively relies in two ways on classifiers implemented in MLlib, the main machine learning library for the Hadoop ecosystem. First, a vast number of classifiers is applied to two datasets, namely Higgs and PAMAP. Second, a two-step classification is ab ovo performed to the same datasets. Specifically, the singular value decomposition of the data matrix determines first a set of transformed attributes which in turn drive the classifiers of MLlib. The twofold purpose of the proposed architecture is to reduce complexity while maintaining a similar if not better level of the metrics of accuracy, recall, and

F 1

. The intuition behind this approach stems from the engineering principle of breaking down complex problems to simpler and more manageable tasks. The experiments based on the same Spark cluster indicate that the proposed architecture outperforms the individual classifiers with respect to both complexity and the abovementioned metrics. Full article

(This article belongs to the Special Issue Mining Humanistic Data 2019)

► Show Figures

Figure 1

13 pages, 671 KiB

Open AccessArticle

Towards Near-Real-Time Intrusion Detection for IoT Devices using Supervised Learning and Apache Spark

by Valerio Morfino and Salvatore Rampone

Electronics 2020, 9(3), 444; https://doi.org/10.3390/electronics9030444 - 6 Mar 2020

Cited by 61 | Viewed by 6349

Abstract

In the fields of Internet of Things (IoT) infrastructures, attack and anomaly detection are rising concerns. With the increased use of IoT infrastructure in every domain, threats and attacks in these infrastructures are also growing proportionally. In this paper the performances of several [...] Read more.

In the fields of Internet of Things (IoT) infrastructures, attack and anomaly detection are rising concerns. With the increased use of IoT infrastructure in every domain, threats and attacks in these infrastructures are also growing proportionally. In this paper the performances of several machine learning algorithms in identifying cyber-attacks (namely SYN-DOS attacks) to IoT systems are compared both in terms of application performances, and in training/application times. We use supervised machine learning algorithms included in the MLlib library of Apache Spark, a fast and general engine for big data processing. We show the implementation details and the performance of those algorithms on public datasets using a training set of up to 2 million instances. We adopt a Cloud environment, emphasizing the importance of the scalability and of the elasticity of use. Results show that all the Spark algorithms used result in a very good identification accuracy (>99%). Overall, one of them, Random Forest, achieves an accuracy of 1. We also report a very short training time (23.22 sec for Decision Tree with 2 million rows). The experiments also show a very low application time (0.13 sec for over than 600,000 instances for Random Forest) using Apache Spark in the Cloud. Furthermore, the explicit model generated by Random Forest is very easy-to-implement using high- or low-level programming languages. In light of the results obtained, both in terms of computation times and identification performance, a hybrid approach for the detection of SYN-DOS cyber-attacks on IoT devices is proposed: the application of an explicit Random Forest model, implemented directly on the IoT device, along with a second level analysis (training) performed in the Cloud. Full article

(This article belongs to the Special Issue Recent Machine Learning Applications to Internet of Things (IoT))

► Show Figures

Figure 1

13 pages, 925 KiB

Open AccessArticle

Comparative Study between Big Data Analysis Techniques in Intrusion Detection

by Mounir Hafsa and Farah Jemili

Big Data Cogn. Comput. 2019, 3(1), 1; https://doi.org/10.3390/bdcc3010001 - 20 Dec 2018

Cited by 31 | Viewed by 6433

Abstract

Cybersecurity ventures expect that cyber-attack damage costs will rise to $11.5 billion in 2019 and that a business will fall victim to a cyber-attack every 14 seconds. Notice here that the time frame for such an event is seconds. With petabytes of data [...] Read more.

Cybersecurity ventures expect that cyber-attack damage costs will rise to $11.5 billion in 2019 and that a business will fall victim to a cyber-attack every 14 seconds. Notice here that the time frame for such an event is seconds. With petabytes of data generated each day, this is a challenging task for traditional intrusion detection systems (IDSs). Protecting sensitive information is a major concern for both businesses and governments. Therefore, the need for a real-time, large-scale and effective IDS is a must. In this work, we present a cloud-based, fault tolerant, scalable and distributed IDS that uses Apache Spark Structured Streaming and its Machine Learning library (MLlib) to detect intrusions in real-time. To demonstrate the efficacy and effectivity of this system, we implement the proposed system within Microsoft Azure Cloud, as it provides both processing power and storage capabilities. A decision tree algorithm is used to predict the nature of incoming data. For this task, the use of the MAWILab dataset as a data source will give better insights about the system capabilities against cyber-attacks. The experimental results showed a 99.95% accuracy and more than 55,175 events per second were processed by the proposed system on a small cluster. Full article

20 pages, 1619 KiB

Open AccessArticle

A Two-Stage Big Data Analytics Framework with Real World Applications Using Spark Machine Learning and Long Short-Term Memory Network

by Muhammad Ashfaq Khan, Md. Rezaul Karim and Yangwoo Kim

Symmetry 2018, 10(10), 485; https://doi.org/10.3390/sym10100485 - 11 Oct 2018

Cited by 47 | Viewed by 7235

Abstract

Every day we experience unprecedented data growth from numerous sources, which contribute to big data in terms of volume, velocity, and variability. These datasets again impose great challenges to analytics framework and computational resources, making the overall analysis difficult for extracting meaningful information [...] Read more.

Every day we experience unprecedented data growth from numerous sources, which contribute to big data in terms of volume, velocity, and variability. These datasets again impose great challenges to analytics framework and computational resources, making the overall analysis difficult for extracting meaningful information in a timely manner. Thus, to harness these kinds of challenges, developing an efficient big data analytics framework is an important research topic. Consequently, to address these challenges by exploiting non-linear relationships from very large and high-dimensional datasets, machine learning (ML) and deep learning (DL) algorithms are being used in analytics frameworks. Apache Spark has been in use as the fastest big data processing arsenal, which helps to solve iterative ML tasks, using distributed ML library called Spark MLlib. Considering real-world research problems, DL architectures such as Long Short-Term Memory (LSTM) is an effective approach to overcoming practical issues such as reduced accuracy, long-term sequence dependency, and vanishing and exploding gradient in conventional deep architectures. In this paper, we propose an efficient analytics framework, which is technically a progressive machine learning technique merged with Spark-based linear models, Multilayer Perceptron (MLP) and LSTM, using a two-stage cascade structure in order to enhance the predictive accuracy. Our proposed architecture enables us to organize big data analytics in a scalable and efficient way. To show the effectiveness of our framework, we applied the cascading structure to two different real-life datasets to solve a multiclass and a binary classification problem, respectively. Experimental results show that our analytical framework outperforms state-of-the-art approaches with a high-level of classification accuracy. Full article

(This article belongs to the Special Issue Information Technology and Its Applications 2021)

► Show Figures

Graphical abstract

21 pages, 343 KiB

Open AccessArticle

Large Scale Implementations for Twitter Sentiment Classification

by Andreas Kanavos, Nikolaos Nodarakis, Spyros Sioutas, Athanasios Tsakalidis, Dimitrios Tsolis and Giannis Tzimas

Algorithms 2017, 10(1), 33; https://doi.org/10.3390/a10010033 - 4 Mar 2017

Cited by 58 | Viewed by 7092

Abstract

Sentiment Analysis on Twitter Data is indeed a challenging problem due to the nature, diversity and volume of the data. People tend to express their feelings freely, which makes Twitter an ideal source for accumulating a vast amount of opinions towards a wide [...] Read more.

Sentiment Analysis on Twitter Data is indeed a challenging problem due to the nature, diversity and volume of the data. People tend to express their feelings freely, which makes Twitter an ideal source for accumulating a vast amount of opinions towards a wide spectrum of topics. This amount of information offers huge potential and can be harnessed to receive the sentiment tendency towards these topics. However, since no one can invest an infinite amount of time to read through these tweets, an automated decision making approach is necessary. Nevertheless, most existing solutions are limited in centralized environments only. Thus, they can only process at most a few thousand tweets. Such a sample is not representative in order to define the sentiment polarity towards a topic due to the massive number of tweets published daily. In this work, we develop two systems: the first in the MapReduce and the second in the Apache Spark framework for programming with Big Data. The algorithm exploits all hashtags and emoticons inside a tweet, as sentiment labels, and proceeds to a classification method of diverse sentiment types in a parallel and distributed manner. Moreover, the sentiment analysis tool is based on Machine Learning methodologies alongside Natural Language Processing techniques and utilizes Apache Spark’s Machine learning library, MLlib. In order to address the nature of Big Data, we introduce some pre-processing steps for achieving better results in Sentiment Analysis as well as Bloom filters to compact the storage size of intermediate data and boost the performance of our algorithm. Finally, the proposed system was trained and validated with real data crawled by Twitter, and, through an extensive experimental evaluation, we prove that our solution is efficient, robust and scalable while confirming the quality of our sentiment identification. Full article

(This article belongs to the Special Issue Humanistic Data Processing)

► Show Figures

Figure 1

Search Results (15)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (15)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI