Big Data Mining and Classiﬁcation of Intelligent Material Science Data Using Machine Learning

: There is a high need for a big data repository for material compositions and their derived analytics of metal strength, in the material science community. Currently, many researchers maintain their own excel sheets, prepared manually by their team by tabulating the experimental data collected from scientiﬁc journals, and analyzing the data by performing manual calculations using formulas to determine the strength of the material. In this study, we propose a big data storage for material science data and its processing parameters information to address the laborious process of data tabulation from scientiﬁc articles, data mining techniques to retrieve the information from databases to perform big data analytics, and a machine learning prediction model to determine material strength insights. Three models are proposed based on Logistic regression, Support vector Machine SVM and Random Forest Algorithms. These models are trained and tested using a 10-fold cross validation approach. The Random Forest classiﬁcation model performed better on the independent dataset, with 87% accuracy in comparison to Logistic regression and SVM with 72% and 78%, respectively.


Introduction
Our study is based on Intelligent Material science data (Magnesium-Alloy, Mg-alloy), which is structured to be a lightweight alloy without compromising the strength of the material and is widely used across multiple domains, including the United States Armed Forces [1][2][3].The Mg-alloy data is composed of different metal compositions and their processing parameter information involving different techniques such as casting, extrusion, rolling, forging [4][5][6][7][8][9][10][11], etc.Currently, there is no existing data repository for storing Mgalloy data.In this study, we propose a big data repository that archives the Mg-alloy data retrieved from scientific journals on which big data analytics, predictive modeling can be performed.The Mg-alloy tests are classified based on tensile strength properties including strength and the ductility of the metals.Based on these tensile properties, we propose the prediction model that will help in the future to determine the boundary conditions for strong Mg-alloy material for the specific purposes of an application.
We will focus on how to address this material science research problem in creating a big data repository and predicting the strength of the material by applying Machine Learning (ML) models to the big data retrieved from the data repository.We have applied MongoDB for management of a Mg-alloy big database and a logistic regression algorithm for its binary classification [12][13][14].MongoDB, a big data tool, for data management of Mgalloy is chosen, as NoSQL (non-SQL, non-Structured Query Language or non-relational) databases like MongoDB are often better suited for storing and modeling semi-structured Mg-alloy data [15][16][17].The Mg-alloy data is stored as documents on MongoDB, which is a popular NoSQL database [13].The documents are queried from the Mongo collection to retrieve the tensile properties of the Mg-alloy as these properties of metals will help us to determine the strength of metals.
Tensile is a mechanical testing method, through which tensile properties such as tensile yield strength (YS), ultimate tensile strength (UTS), and elongation-at-fracture (EL) of the material under testing can be obtained.These three tensile properties are independent of each other.Ductility of the metal depends on the tensile elongation-at-fracture (EL).The higher the elongation, the higher the ductility, and vice versa.Our goal is to find a statistical correlation between these three independent variables using machine learning (ML) predictions.Using ML techniques, ductility (a physical property) of the Mg-alloy is determined by considering the tensile yield strength (YS), ultimate tensile strength (UTS) as independent variables.
Mg-alloy can form a strong or weak material, based on the combination of metal composition, its processing methods/parameters and tensile properties.In our methodology we are only considering tensile properties due to the sparse nature of the dataset in the variables of metal compositions, processing methods/parameters.Strong Mg-alloy is ductile metal and weak Mg-alloy is brittle metal.Mg-alloy can belong to either of these two classes.To address this classification problem, there are many possible popular Machine Learning classification algorithms, such as Support vector machine (SVM), Linear regression, K-nearest neighbor, Logistic regression, Random Forest, etc. [18][19][20][21][22][23][24].However, choosing the best algorithm to solve a given problem depends on the number of factors; namely, accuracy of the model, data set formats, number of parameters, features, etc. Further, based on the above factors, nature of problem, presence of sparse dataset and the number of dependent and independent variables, the Logistic regression algorithm was chosen for Mg-alloy classification.
The classification algorithms Support vector machine SVM and Random Forest are also applied on the independent variables to predict the output variable, ductility of the metal.The comparison of classification metrics of all the three algorithms is done.
We trained our model on the extracted dataset with 128 data rows, tested on 33 data rows of independent dataset and observed from our results that we achieved significant accuracies greater than 70% for all the three machine learning classification models-Logistic regression, SVM and Random Forest.The accuracies using 10-fold cross validation on the train dataset of 128 data rows for Logistic Regression, Support Vector Machine SVM and Random Forest are 70.25%,75.76% and 72.69%, respectively, and the accuracies when tested on independent test data of 33 rows were 72.72%, 78.78 and 87.87%, respectively.

Literature Review
The Mg-alloy is used in a variety of applications due to its abundant availability [2,3,[25][26][27][28][29][30][31][32].It has numerous advantages when compared to other metals.Mg-alloy is a light-weight metal with high ductility when processed using the right techniques and at appropriate conditions [1].Because of its enormous applications and benefits, there are numerous ongoing research studies in the material science field to determine the composition of different metals to form a high-quality Mg-alloy.
The Magnesium-alloy data is collected manually in tabulated logbooks from scientific literature review and laboratory experiments [33][34][35][36].Hence, there is a high need of developing big material science data repositories to automate the strong Mg-alloy metal based on material characterization, process parameters/methods.This leads to the proposal of the big data repository for the Mg-alloy data.Unless there is a big data repository, we will not be able to apply automatic machine learning (ML) predictions to determine the strength of Mg-alloy product.The selection of the database is based on different factors under consideration such as domain, structured, unstructured data, relation between data elements, data size, complexity, cost, scalability, and data analytics tools, etc.
As the Mg-alloy data is semi-structured data, the traditional relational database management systems (RDBMS), SQL (Structured Query Language or relational) databases may lead to horizontal scalability issues for storing the dynamically changing semi-structured data [15].The RDBMS cannot store or process the big data efficiently and the emerge of No-SQL database(non-SQL or non-relational) has become the popular solution to overcome this issue [37].The No-SQL databases are better suited for semi-structured or unstructured data and outperforms in comparison to RDBMS [38].When exponentially increasing big data was stored on RDBMS, it was observed that the performance of the database was very low while applying data extraction techniques, due to SQL query execution times being slower for these operations [37].These performance issues can be addressed with No-SQL databases by expanding the horizontal scalability [39].To overcome these efficacy problems, different types of No-SQL databases emerged, and one of those is the popular Mongo-DB which supports a document data model [40].In a recent study, the Electronic Health Records (EHR) data is semi-structured big data and MongoDB is used for data management and analysis of EHR data [15].Similarly, the Mg-alloy data is semi-structured data and can be efficiently stored on a No-SQL database.
Mg-alloy data is semi-structured big data that grows at an ever-increasing rate, once the data collection process is automated, and a data repository is created.Thus, we choose MongoDB as the big data repository for the Mg-alloy data, which is suitable for semi-structured data and has a cost of one-tenth when compared to RDBMS [15].The possible data models for the No-SQL databases are key-value, column-oriented or document database [41].The data on MongoDB is stored as a document, and the document contains key-value pairs to hold the data [13,41].MongoDB also supports embedded data models which reduces I/O operations [13].It provides the redundant data or replica sets distributed across the cluster [15].The 112 datasets are built from a larger body of literature to collect different metal compositions of Zn, Al, Mn, Ca, Si and their processing parameters information.Only rolled and extruded alloys are used to build the dataset by including the attributes related to extrusion [42].
In another logistic regression study, the authors have integrated conjugate gradient method to the logistic regression model for addressing the binary classification of customer churn problem for airline business [43].A machine learning based spatio-temporal data mining approach is used to detect the HABs (Harmful Algal blooms) events in the region of Gulf of Mexico [20,23].Here, the authors used the Kernel based support vector machine as a classifier in the detection of HABs and also predicting within a window of 7 days [20,23].In one of the recent studies, Logistic regression is used not only to predict the football matches' results, but also to determine the significant variables that contribute to win a match [12].In [44], the ML based random forest classifier is used for the fault classification in the power systems network.The comparative study of ML techniques, such as random forest, SVM, KNN and Logistic regression techniques has been explored for fault detection [44].
The ML models can also be used for detecting cyber-attacks.It is also equally important for the ML model itself to be robust against the adversarial examples to avoid misclassification or incorrect prediction.An adversarial example is a sample created by adding a little noise to the original sample data, although presenting no change identifiable to human perception will be misclassified by a deep neural network [45].In recent studies, the machine learning security against adversarial examples is implemented for textual classification [45][46][47][48].The syntactically controlled paraphrase networks (SCPN) are developed to generate the adversarial examples for both sentiment analysis and textual entailment [46].These examples are introduced in the training data to increase the robustness of these models to syntactic variation.
In [47], the authors discovered that the ensemble of the weak defenses is not sufficient to provide strong defense against adversarial examples.A friend-guard adversarial example is created that will be correctly classified by the friend model and misclassified by the enemy model without introducing any changes in meaning or grammar that will be perceived by humans [47].The machine learning security is not only used for generating the adversarial examples but also used to detect the backdoor attacks for multiple Deep neural networks [49].
In one of the previous Mg-alloy studies, the popular machine learning models Support vector machine SVM, Artificial neural network ANN is used to predict the mechanical properties of the Mg-alloy [18][19][20][21][22][23][24]42].The tensile properties YS, UTS and Elongationat-fracture (EL) are the three outputs of the output layer [42].We are using mechanical properties of the alloy YS, UTS as the input variables, to predict the output variable ductility which depends on tensile elongation-at-fracture (EL).We have applied machine learning predictions of three popular classification algorithms-Logistic regression, Support Vector Machine SVM and Random Forest on the Mg-alloy data and our contributions to this study are: • Created a big data storage for material science data.

•
Developed data mining techniques to retrieve the required data from a database to perform data analytics.

•
Developed machine learning prediction model to determine the strength of metals.

Methodology
The methodology includes three steps of implementation; namely, Data Management, Big Data Mining and Data Classification.Each step is discussed in detail.

Data Management
Listed below are the steps that we followed for archival of Mg-alloy big data onto the MongoDB.These steps include the details of the processes starting from Data collection to the Data storage onto the MONGODB.
Data Model Design 4.
Data Conversion Model 5.

Data Transfer Model
Here, steps 1 and 2 are done manually for collecting and preprocessing the data, and detailed information is discussed in Sections 4.1 and 4.3, respectively.For steps 3 and 4, we developed the code using python programming language.This program includes both the functionality of the data grouping as per the schema design chosen, and conversion of input CSV (Comma Separated Value) file format to a JSON (JavaScript Object Notation) file format.The technical information of steps 3 and 4 is discussed in detail in Sections 4.2 and 4.4, respectively.
We observed that the python program developed combinedly for step-3 and 4 took the execution time ranging from 48-68 ms for multiple attempts, to achieve the grouping of the input data of 218 rows per the data model design, and converting it into JSON file format resulting in 143 documents, with embedded document structure.For step 5, we used Studio 3T, a Mongo GUI (Graphical User Interface) tool to import the data onto the database.When we tested the data import functionality with multiple execution attempts, it was observed that the execution times ranged from 83-152 ms for importing 143 documents on to the database.
The statistics for the execution times in steps 3 to 5 are listed in Table 1.Data mining technique includes the below two steps.These steps are implemented for retrieval of Mg-alloy tensile properties from the database in JSON file format and parsed to a CSV file format.

1.
Query document to retrieve tensile properties of metals resulting in JSON file format.

2.
Convert JSON to CSV file format and prepare the dataset to feed the machine learning model.
We developed two programs using python programming language for the above two steps.The technical details are discussed in Section 5.The two programs are:

•
Program-1 is for retrieving all existing tensile values from database and parsing the result to the CSV file format to feed the machine learning ready.

•
Program-2 is for retrieving all numerical tensile values from the database and parsing the result to CSV file format to feed the machine learning ready.
When programs 1 and 2 were run multiple times, we noticed that the execution times for combined functionality of data retrieval from MongoDB and parsing the result to CSV file format ranged from 270-554 ms.
The statistics for the execution times for Program-1 and 2 for multiple execution attempts are listed in Table 2.

Data Classification
The Logistic regression model is built with tensile properties of metals as input and output variables.Tensile Yield Strength (YS) and Ultimate Tensile Strength (UTS) are used as input variables to predict the output variable ductility which depends on Elongation-atfracture (EL).

Data Collection
The Mg-alloy data was collected from scientific articles through the literature re view [33][34][35][36].The retrieved data was tabulated in a semi-structured CSV file format.Th information in the data collected includes the details of the author, publisher, followed b Mg-alloy details.The Mg-alloy details include the metal compositions, its processing pa rameters, mechanical and corrosion properties.The processing information when furthe drilled down includes the details of casting, extrusion and rolling temperatures of the a loy and the mechanical properties includes the tensile and compression properties of th Mg-alloy.The data collected from each scientific article may have multiple combination of the metal compositions, its processing, mechanical and corrosion properties to form magnesium-alloy.Each material composition combination is written as a single row o data in the CSV file.Multiple scenarios exist where multiple rows of data belong to on single scientific journal.
The data is raw and was cleaned and preprocessed manually.This cleaned data wa then exported to the database.As the clean data was in a semi-structured CSV file, th format needs to be converted to a JSON so that it can be imported onto the No-SQL data base.

Data Model Design
MongoDB supports two types of data model design, such as the embedded dat model and the normalized data model [13].The embedded data model is further classifie as a one-to-one relationship embedded document and one-to-many relationship embed ded document.In general, normalized data design is used for complex many-to-man relationships between the connected data.So, here the decision of the data design an whether to choose the embedded model or normalized model depends on how the dat is connected and what kind of relation is formed between the data.When we have ana lyzed the Mg-alloy data manually, we observed that for most of the articles, there ar multiple data rows related to each scientific article and only few data rows are holdin one-to-one relation.So, with most of the one-to-many relation between data, we have de cided to choose one-to-many embedded model design for the Mg-alloy data.The reaso for choosing the embedded models is, in each row of the Mg-alloy data, the material com positions and mechanical tensile properties vary widely in different ranges whereas th processing and compression property details remain the same for most of the row leve data and can be grouped together.With this observation, the schema design chosen fo

Data Management 4.1. Data Collection
The Mg-alloy data was collected from scientific articles through the literature review [33][34][35][36].The retrieved data was tabulated in a semi-structured CSV file format.The information in the data collected includes the details of the author, publisher, followed by Mg-alloy details.The Mg-alloy details include the metal compositions, its processing parameters, mechanical and corrosion properties.The processing information when further drilled down includes the details of casting, extrusion and rolling temperatures of the alloy and the mechanical properties includes the tensile and compression properties of the Mg-alloy.The data collected from each scientific article may have multiple combinations of the metal compositions, its processing, mechanical and corrosion properties to form magnesium-alloy.Each material composition combination is written as a single row of data in the CSV file.Multiple scenarios exist where multiple rows of data belong to one single scientific journal.
The data is raw and was cleaned and preprocessed manually.This cleaned data was then exported to the database.As the clean data was in a semi-structured CSV file, the format needs to be converted to a JSON so that it can be imported onto the No-SQL database.

Data Model Design
MongoDB supports two types of data model design, such as the embedded data model and the normalized data model [13].The embedded data model is further classified as a one-to-one relationship embedded document and one-to-many relationship embedded document.In general, normalized data design is used for complex many-to-many relationships between the connected data.So, here the decision of the data design and whether to choose the embedded model or normalized model depends on how the data is connected and what kind of relation is formed between the data.When we have analyzed the Mg-alloy data manually, we observed that for most of the articles, there are multiple data rows related to each scientific article and only few data rows are holding one-to-one relation.So, with most of the one-to-many relation between data, we have decided to choose one-to-many embedded model design for the Mg-alloy data.The reason for choosing the embedded models is, in each row of the Mg-alloy data, the material compositions and mechanical tensile properties vary widely in different ranges whereas the processing and compression property details remain the same for most of the row level data and can be grouped together.With this observation, the schema design chosen for intelligent data is the embedded data model, because of the nature of the possibility of data grouping.The data storage is done using nested documents.By using this schema design, we can store 217 rows of data as 143 documents.Each document is stored as a key-value pair.The key of each document is the unique Id, and Value holds 39 fields; one of the fields is named "Metal Properties", which holds the embedded documents.These embedded documents may have multiple elements forming the sub-embedded documents and each element has 30 fields internally.If there are multiple data rows related to one single article, then the "Metal Properties" field holds multiple elements and each element in turn holds 30 fields.The embedded and sub-embedded documents also hold the data in the form of key-value pairs.Figure 2 shows the data model design for the Mg-Alloy data.
Appl.Sci.2021, 11, 8596 7 of 17 design, we can store 217 rows of data as 143 documents.Each document is stored as a keyvalue pair.The key of each document is the unique Id, and Value holds 39 fields; one of the fields is named "Metal Properties", which holds the embedded documents.These embedded documents may have multiple elements forming the sub-embedded documents and each element has 30 fields internally.If there are multiple data rows related to one single article, then the "Metal Properties" field holds multiple elements and each element in turn holds 30 fields.The embedded and sub-embedded documents also hold the data in the form of key-value pairs.Figure 2 shows the data model design for the Mg-Alloy data.

Data Preprocessing
The raw data is cleaned manually to remove the merged cells of rows and columns in a CSV file.The special characters are removed from the data and the handling of numerical missing data in Mg-alloy parameters is done using NANS (Not A Number), as the empty cells do not add any information to the actual Mg-alloy data because only few metals compositions are combined to form Mg-alloy and the rest of the metal composition percent is not present.If a scientific article has multiple rows of data, then information related to the author, title, nationality of the author, institute of the author, and publication source link are repeated in the first five columns of the CSV file during this phase.This redundancy is eliminated by grouping these columns combined with processing and corrosion properties in the embedded document while converting CSV file to JSON file format.

Data Conversion Model
The data in mongo DB can be stored in the form of a JSON object [13].The semistructured preprocessed data is converted from CSV to JSON file format and is stored as a nested document in the database.The details of scientific articles like, author, title, nationality of the author, institute of the author, and publication source link are grouped together with processing and corrosion properties of the Mg-alloy which belong to the main document.Each Mg-alloy has unique metal composition and mechanical properties; these two properties are grouped together to form embedded documents in the main document.If each article has multiple Mg-alloy metal combinations, then we have multiple embedded documents in the main document; embedded documents in turn may contain multiple sub-embedded documents.All the main documents combined form a collection.

Data Preprocessing
The raw data is cleaned manually to remove the merged cells of rows and columns in a CSV file.The special characters are removed from the data and the handling of numerical missing data in Mg-alloy parameters is done using NANS (Not A Number), as the empty cells do not add any information to the actual Mg-alloy data because only few metals compositions are combined to form Mg-alloy and the rest of the metal composition percent is not present.If a scientific article has multiple rows of data, then information related to the author, title, nationality of the author, institute of the author, and publication source link are repeated in the first five columns of the CSV file during this phase.This redundancy is eliminated by grouping these columns combined with processing and corrosion properties in the embedded document while converting CSV file to JSON file format.

Data Conversion Model
The data in mongo DB can be stored in the form of a JSON object [13].The semistructured preprocessed data is converted from CSV to JSON file format and is stored as a nested document in the database.The details of scientific articles like, author, title, nationality of the author, institute of the author, and publication source link are grouped together with processing and corrosion properties of the Mg-alloy which belong to the main document.Each Mg-alloy has unique metal composition and mechanical properties; these two properties are grouped together to form embedded documents in the main document.If each article has multiple Mg-alloy metal combinations, then we have multiple embedded documents in the main document; embedded documents in turn may contain multiple sub-embedded documents.All the main documents combined form a collection.This collection is a JSON file that is ready to be imported into MongoDB.We have 217 rows of data collected and transformed into 143 documents in the JSON file format.

Data Transfer Model
We used a standalone MongoDB cluster on the local machine by downloading and installing MongoDB on the local machine as per the documentation provided in the MongoDB official website [13].We can start the server on the machine and establish the connection using mongo shell from the command prompt or using the Mongo compass GUI tool by providing the connection string.Once the connection is successful, we have created a database and named the collection.Browse the JSON file i.e., collection saved on your machine and add it to the database.Now the data is successfully imported onto the NOSQL database.We can add, update, or delete the documents using any Mongo GUI tool like Mongo Compass or Studio 3T GUI.We have used studio 3T GUI for any kind of database operations on MongoDB and the JSON file is imported into MongoDB.We are successfully able to import 143 documents onto the database.

Big Data Mining
There are 39 fields for each document which includes the Unique Id of the document.One of the fields being "Metal Properties" which forms the nested embedded document in the main document.This embedded document can hold n elements and each element holds 30 more fields which includes the information of metal compositions and their mechanical and corrosion properties.Our focus is to extract the mechanical tensile properties of the metals, as this property is the deciding factor for the ductility of the Mg-alloy.To retrieve these values from the database.Firstly, establish a connection through MongoClient function which can be imported from the PyMongo Library in the python script.Once the connection is successfully established, specify the database and collection name to access the data from the collection.Based on the field names, build a query, and apply projections to retrieve the required data.Below are the two sample queries that we have used to extract the mechanical tensile properties of the alloy.The result will be a JSON object.

Query to Retrieve All Existing Tensile Values
The document in the MongoDB is queried by setting filter values to exist for all the three tensile property values and "OR" condition is applied among the tensile values.The projection is to include the tensile column values in the result set.In the projection, if we choose the option as include, then the column value is set to one in the query code.We have used Mongo Visual Query builder to build the query and to set the filters and projections for retrieving tensile properties.Pseudocode 1 shows the query code for retrieving all existing tensile values from the database.
Pseudocode The document in the MongoDB is queried by setting filter values that do not equal to NAN for all the three tensile property values and "AND" condition is applied among the tensile values.The projection is to include the tensile column values in the result set.In the projection, if we choose the option as include, then the column value is set to one; otherwise, if we choose to exclude the column, value is set to zero in the query code.Pseudocode 2 shows the query code to retrieve all numerical tensile values.The document in the MongoDB is queried by setting filter values that do n to NAN for all the three tensile property values and "AND" condition is applied the tensile values.The projection is to include the tensile column values in the re In the projection, if we choose the option as include, then the column value is se otherwise, if we choose to exclude the column, value is set to zero in the query cod docode 2 shows the query code to retrieve all numerical tensile values. Pseudocode

Pseudocode Execution Flow
Figure 3 shows the steps of execution of the query codes in Pseudocode 1 an docode 2, and we can see that the collection scan is applied on the Database.Co Name.The specified filters and projections are applied on query code of Pseud and Pseudocode 2 to retrieve the tensile values, and the result is displayed.

Convert JSON to CSV File Format and Prepare the Dataset to Feed the Machine Learning Model
The document is queried using the query code shown in Pseudocode 1 to retrieve all existing tensile values, and we can retrieve 143 documents.This result set is in the form of JSON object.This JSON object is parsed and written to a CSV file using python scripting which forms an input to the ML algorithm to perform classification of Mg-alloy.When JSON to CSV conversion is done we have observed 217 rows of tensile values which matches the count of original dataset.Figure 4 shows the sample of 20 rows of retrieved results.

Convert JSON to CSV File Format and Prepare the Dataset to Feed the Machine Learning Model
The document is queried using the query code shown in Pseudocode 1 to retrieve all existing tensile values, and we can retrieve 143 documents.This result set is in the form of JSON object.This JSON object is parsed and written to a CSV file using python scripting which forms an input to the ML algorithm to perform classification of Mg-alloy.When JSON to CSV conversion is done we have observed 217 rows of tensile values which matches the count of original dataset.Figure 4 shows the sample of 20 rows of retrieved results.The query code of Pseudocode 2 is applied to exclude tensile properties with NAN values and the result has 108 documents with continuous numerical values for all three tensile properties.These 108 retrieved documents are in JSON format and are parsed to CSV to form 161 rows of data in CSV format.Figure 5 shows a sample of 20 rows of retrieved results.The query code of Pseudocode 2 is applied to exclude tensile properties with NAN values and the result has 108 documents with continuous numerical values for all three tensile properties.These 108 retrieved documents are in JSON format and are parsed to CSV to form 161 rows of data in CSV format.Figure 5 shows a sample of 20 rows of retrieved results.

Convert JSON to CSV File Format and Prepare the Dataset to Feed the Machine Learning Model
The document is queried using the query code shown in Pseudocode 1 to retrieve all existing tensile values, and we can retrieve 143 documents.This result set is in the form of JSON object.This JSON object is parsed and written to a CSV file using python scripting which forms an input to the ML algorithm to perform classification of Mg-alloy.When JSON to CSV conversion is done we have observed 217 rows of tensile values which matches the count of original dataset.Figure 4 shows the sample of 20 rows of retrieved results.The query code of Pseudocode 2 is applied to exclude tensile properties with NAN values and the result has 108 documents with continuous numerical values for all three tensile properties.These 108 retrieved documents are in JSON format and are parsed to CSV to form 161 rows of data in CSV format.Figure 5 shows a sample of 20 rows of retrieved results.

Data Classification
The logistic regression algorithm is a popular binary classification algorithm [14].In this model, the probabilities describing the possible outcomes are modeled using a logistic function known as sigmoid function.This function estimates the probability of an outcome to a value of 0 or 1, which helps to determine the class of output variable.It is an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1. Usually, zero is predicted as negative class and 1 as positive class.
Below is an example of the logistic equation.Input values (x) are combined linearly using weights or coefficient values to predict an output value (y).The output value being modeled is a binary value (0 or 1) rather than a numeric value which denotes the probability of the class.
where B0: Bias or intercept term B1: Coefficient of single input value (x) In the Mg-alloy data, we chose the tensile properties YS, UTS as the independent input variables to the algorithm to predict the dependent output variable ductility.We have applied a logistic regression model to classify the Mg-alloy data.The classification result one is classified as ductile metal and zero as brittle metal.The classification accuracies are calculated, and spot check algorithm comparison is done for all three classification models; namely, Logistic regression, Support vector machine (SVM) and Random forest.

Results and Discussion of Data Classification
The dataset contains mechanical tensile properties YS, UTS and EL.We have introduced a dummy categorical variable called ductility based on the value of EL.Ductility holds value as one if EL value is greater than 15 and holds value as zero if EL value is less than 15.We chose Tensile YS, UTS as independent input variables and ductility as the output variable.The dataset is now divided into test and train samples.Out of 161 data rows, 80 percent of the dataset, i.e., 128 rows form the train data set and 20 percent of dataset i.e., 33 rows form the test dataset.The independent input variables in the train dataset are normalized using Standard Scaler function before fitting to the model.The 10-fold cross validation is applied on the 128 rows and observed the accuracies of 70.25%, 75% and 71.92% for Logistic Regression, SVM, Random Forest, respectively.Figure 6 shows the spot-check algorithm comparison of all the three models.

Data Classification
The logistic regression algorithm is a popular binary classification algorithm [14].In this model, the probabilities describing the possible outcomes are modeled using a logistic function known as sigmoid function.This function estimates the probability of an outcome to a value of 0 or 1, which helps to determine the class of output variable.It is an Sshaped curve that can take any real-valued number and map it into a value between 0 and 1. Usually, zero is predicted as negative class and 1 as positive class.
Below is an example of the logistic equation.Input values (x) are combined linearly using weights or coefficient values to predict an output value (y).The output value being modeled is a binary value (0 or 1) rather than a numeric value which denotes the probability of the class.In the Mg-alloy data, we chose the tensile properties YS, UTS as the independent input variables to the algorithm to predict the dependent output variable ductility.We have applied a logistic regression model to classify the Mg-alloy data.The classification result one is classified as ductile metal and zero as brittle metal.The classification accuracies are calculated, and spot check algorithm comparison is done for all three classification models; namely, Logistic regression, Support vector machine (SVM) and Random forest.

Results and Discussion of Data Classification
The dataset contains mechanical tensile properties YS, UTS and EL.We have introduced a dummy categorical variable called ductility based on the value of EL.Ductility holds value as one if EL value is greater than 15 and holds value as zero if EL value is less than 15.We chose Tensile YS, UTS as independent input variables and ductility as the output variable.The dataset is now divided into test and train samples.Out of 161 data rows, 80 percent of the dataset, i.e., 128 rows form the train data set and 20 percent of dataset i.e., 33 rows form the test dataset.The independent input variables in the train dataset are normalized using Standard Scaler function before fitting to the model.The 10fold cross validation is applied on the 128 rows and observed the accuracies of 70.25%, 75% and 71.92% for Logistic Regression, SVM, Random Forest, respectively.Figure 6 shows the spot-check algorithm comparison of all the three models.The predictions are made by fitting the transformed input values, output values of the train dataset (128 data rows) and tested on an independent dataset of 33 rows using the Logistic regression and SVM models, with observed accuracies of 72.72% and 78.78%,For SVM, we have experimented with the polynomial kernel transformation of degree = 3 and observed the accuracy of 0.75, whereas with degree = 8 the accuracy is 0.78.However Radial-basis function (RBF) gave the similar accuracy equals to 0.78 when regularization parameter C value is 1.0.Based on this observation, we can say that polynomial kernel with degree = 3 underperformed when compared to RBF because the characteristics of the data distribution falls in the radial basis curve.Despite the lower accuracy, the nonlinear polynomial kernel SVM-model with degree = 3 is preferred rather than degree = 8 as it is difficult to generalize the model with higher polynomial degree.
With these results, we can determine the best model for the Mg-alloy dataset, but before making the decision, we also calculated one of the important metrics Receiver Operating characteristics (ROC) and the Area under curve (AUC)-ROC.ROC curves are very helpful to understand the balance between true-positive rate and false positive rates.For calculating Receiver Operating Characteristic Curve (ROC), the probabilities are predicted for the independent variables of the test dataset (33 rows) and only positive probability is considered to calculate the ROC_AUC scores.The True positive and False positive rates are calculated which forms an input to plot the roc-curve.The Area under curve (AUC) for ROC is calculated for all the three models and tabulated in Table 5, and ROC curve plots are shown in Figure 7. Table 5 shows the ROC-AUC scores for all three models.The ROC-AUC values range from 0 to 1.The models with ROC_AUC score close to 1 has better performance compared to others.
From Table 5, we observe that SVM-Model-2 and Random Forest have outperformed compared to the other two models.The AUC_ROC value for SVM-model-2 is high as equal as Random Forest but we do not prefer the SVM-Model-2, as this model is built using polynomial kernel with a higher degree = 8 which is difficult to generalize the model.Despite that, the SVM model with polynomial kernel of degree = 3 is preferred, achieving an AUC_ROC value of 0.818.Hence, we can consider Random Forest and SVM with polynomial kernel of degree = 3 as the best classification models for the Mg-alloy dataset.
calculating Receiver Operating Characteristic Curve (ROC), the probabilities are predicted for the independent variables of the test dataset (33 rows) and only positive probability is considered to calculate the ROC_AUC scores.The True positive and False positive rates are calculated which forms an input to plot the roc-curve.The Area under curve (AUC) for ROC is calculated for all the three models and tabulated in Table 5, and ROC curve plots are shown in Figure 7.In addition to the classification models, we applied regression analysis for the Mgalloy dataset using the multivariate polynomial regression, because we have two varying independent variables.We chose YS and UTS as the independent variables to predict the values of dependent variable EL.The independent variables were transformed using the polynomial function of degree = 3 and the transformed values of YS and UTS, and original EL values are then split into the into test and train datasets.Out of 161 rows, 80% of dataset forms train data (128 rows) and the remaining 20% of the dataset (33 rows) as test data.The train data is now fit to the polynomial regression model to predict the EL values.The regression model evaluation is done by calculating the metrics, Root Mean square error (RMSE) and Pearson Correlation Coefficient (R-squared value, r 2 ) to measure the amount of variation of predicted values from actual values.Here, we have observed an RMSE value equal to 5.57 and R-squared value equal to 0.60.The R-squared value for ideal regression model must equal 1, but here the observed score is 0.60.This regression scores coincides similarly to the limited classification accuracy performance (less than 80% cross-validation accuracy).
We believe that the ductility model could be dependent further on metal compositions, processing parameters/methods to yield higher Pearson correlation coefficient and classification accuracies.Alternatively, we anticipate further improving the polynomial regression model using data transformation techniques.

Conclusions
In this study, we developed the first ever big data storage for the Mg-alloy data, which is highly essential to develop an intelligent database, automate the machine learning models for determination of strong Mg-alloy materials, and further perform material science analytics.We highly recommend that the MONGODB big data tool has outperformed for the storage of the semi-structured Mg-alloy big data.We observed that the big data mining operations, of insertion and retrieval consumed minimal time in the MONGODB big database.We also experimented with Apache Hive for Mg-alloy data storage and observed that Hive did not perform well when compared to MONGODB because of the nature of the connectivity of the data elements particularly in Mg-alloy database [50].Also, the classification results of the ductility model show that we achieved significant accuracies of up to 75% among all three machine learning classification models.The regression results also indicated limited R-squared value of 0.60 similar to the classification accuracies.There is scope to improve the classification accuracies and regression results, by applying data transformation techniques.
We can extend this study by (i) expanding the input variables to include the metal composition values, processing methods/parameters, and tensile properties to determine the ductility of the metal.However, these variables are a highly sparse dataset depending on the nature of the choice of metal compositions in different iterations.We would like to address this problem of a sparse dataset by exploring the sub-sampling methods and singular-value decomposition (SVD) of ML techniques.In future work, (ii) we would also like to develop the inverse model of predicting metal composition, processing methods, and their parameters by considering tensile properties and ductility of the metal as independent input variables.

Figure 1
Figure 1 is the model diagram representing the flow of the processes involving data management, big data mining and data classification.

Figure 1 .
Figure 1.Model Diagram for the classification of Material Science data.

Figure 1 .
Figure 1.Model Diagram for the classification of Material Science data.

Figure 2 .
Figure 2. Data Model Design for Material Science data.

Figure 2 .
Figure 2. Data Model Design for Material Science data.

Pseudocode 2 .Figure 3
Figure3shows the steps of execution of the query codes in Pseudocode 1 and Pseudocode 2, and we can see that the collection scan is applied on the Database.CollectionName.The specified filters and projections are applied on query code of Pseudocode 1 and Pseudocode 2 to retrieve the tensile values, and the result is displayed.

Figure 4 .
Figure 4. Tensile properties retrieved from database using query from Pseudocode 1.

Figure 5 .
Figure 5. Numerical Tensile properties retrieved from database using query from Pseudocode 2.

Figure 4 .
Figure 4. Tensile properties retrieved from database using query from Pseudocode 1.

Figure 4 .
Figure 4. Tensile properties retrieved from database using query from Pseudocode 1.

Figure 5 .
Figure 5. Numerical Tensile properties retrieved from database using query from Pseudocode 2.Figure 5. Numerical Tensile properties retrieved from database using query from Pseudocode 2.

Figure 5 .
Figure 5. Numerical Tensile properties retrieved from database using query from Pseudocode 2.Figure 5. Numerical Tensile properties retrieved from database using query from Pseudocode 2.

Figure 6 .
Figure 6.Spot Check Algorithm comparison for Logistic Regression, SVM and Random Forest models.

Figure 6 .
Figure 6.Spot Check Algorithm comparison for Logistic Regression, SVM and Random Forest models.

Figure 7 .
Figure 7. ROC curves for Logistic Regression, SVM and Random Forest models.Figure 7. ROC curves for Logistic Regression, SVM and Random Forest models.

Figure 7 .
Figure 7. ROC curves for Logistic Regression, SVM and Random Forest models.Figure 7. ROC curves for Logistic Regression, SVM and Random Forest models.

Table 1 .
Execution times for the functionality Data model design, conversion and transfer model.

Table 2 .
Execution times for data retrieval from database and parsing the resultant JSON file to CSV file format.
1. for the Query to retrieve all tensile values.Query to Retrieve All Tensile Details Whose Tensile Values ≥ 0 or Not NAN's 2. for the Query to retrieve all numerical tensile values.

Table 4 .
Performance Metrics of Mg-alloy dataset on the independent test data (33 rows) using the parameters from Table1.

Table 5 .
ROC-AUC values for Logistic regression, SVM and Random Forest models.