Credibility Analysis of User-Designed Content Using Machine Learning Techniques

: Content is a user-designed form of information, for example, observation, perception, or review. This type of information is more relevant to users, as they can relate it to their experience. The research problem is to identify the credibility and the percentage of credibility as well. Assessment of such content is important to convey the right understanding of the information. Different techniques are used for content analysis, such as voting the content, Machine Learning Techniques, and manual assessment to evaluate the content and the quality of information. In this research article, content analysis is performed by collecting the Movie Review dataset from Kaggle. Features are extracted and the most relevant features are shortlisted for experimentation. The effect of these features is analyzed by using base regression algorithms, such as Linear Regression, Lasso Regression, Ridge Regression, and Decision Tree. The contribution of the research is designing a heterogeneous ensemble regression algorithm for content credibility score assessment, which combines the above baseline methods. Moreover, these factors are also toned down to obtain the values closer to Gradient Descent minimum. Different forms of Error Loss, such as Mean Absolute Error, Mean Squared Error, LogCosh, Huber, and Jacobian, and the performance is optimized by introducing the balancing bias. The accuracy of the algorithm is compared with induvial regression algorithms and ensemble regression separately; this accuracy is 96.29%.


Introduction
Content is increasing rapidly, because of the websites providing a review, Online Social Networking (OSN), blog, e-commerce websites, and discussion forums.Engagement of these websites is comparatively more than any other platform, because of the usefulness of the content.A huge amount of content is created.People read and tend to believe the content and react accordingly.The exactness of information in the content is necessary to not mislead the people.Credibility is decided by classifying the information into credible and non-credible information [1,2].The volume of content is increasing continuously, the pace is likely to be increased in the future.The user creates content with their perception, they are not experts in this field so errors can be expected in content.Though the content is more useful and practical, there is a necessity for an approach to avoid misuse of the content [3][4][5].
The commercialization of this issue has already been started [6].Companies are promoting content creators.To promote their products with positive reviews.The individual user also obtains a reward after writing the review with the product specification.Users may misinterpret reality because of this content [7,8].This can be done by using an evaluation system to identify the information in content in the correct form.If the credibility of information is not checked, this may lead to the inorganic spread of information.Examples of this form of content are spam, marketing using unethical ways, clickbait, Fake News, and enticing content with irrelevant information [9,10].
A user defines content with his/her perception and understanding it may contain varying percentages of information.Extraction of the correct information from the story, post, blog, tweet, and microblog is a challenging task [11].The various threats and misuse of data can be seen in Figure 1.Misleading, Incomplete, Unreliable, Conflicting, Invalid.All these types of information cause deception in users and they may have led to an incorrect opinion, thinking, or decisions based on it.There are several combinations (of misleading information) of A, B, C, D, and E as stated in Figure 1 [11,12].Section 1, covers the introduction to the content generated especially on social networking forums and a blog.Information about the organic content and mechanism to create, propagate and validate is discussed.The possible ways to address any one of the issues are suggested.Section 2, covers materials and methods used to deal with credibility.Additionally, the experimentation performed is discussed in brief.The importance of Feature Analysis and Regression algorithms are discussed.Performance can be optimized using minima and load balancing is also elaborated.Section 3, elaborates on the generic representation of the research work using algorithms and mathematical expressions.Important phrases are covered.The uniqueness of the algorithm relies on a combination of boosting and stacking.Moreover, performance is further improved using load balancing and (search for) minima.Section 4, covers the validation technique applicable for the regression algorithm, wherein various losses are discussed to name a few, Mean Squared Error, Mean Absolute Error, and Huber Loss.This error loss technique is useful to highlight the gap between actual and expected outcomes.
Section 5, covers the Result of the algorithm and their validation using the error loss techniques discussed in Section 4. The separate plotting of the Mean Squared Error, Mean Absolute Error, and Huber Loss is completed.The result of the FERMO approach is satisfactory as the minimal losses indicate the higher efficacy of the algorithm.Section 6 deals with the discussion on the result achieved by the algorithm and the improvement in the result after optimization.The use of a Credibility Score to understand the quality of the content rather than just 1 and 0, is the classification approach.Section 7 explains the Conclusion of the research work completed.The accuracy is improved by using a combination of stacking and boosting mechanisms together followed by optimization and load balancing.Reference Section is a list of References used in the research in the field of content credibility, organic content, machine learning algorithms, user perception, and quality assessment of content for propagation.

Literature Survey
This section covers the research in the field of information credibility, fake news detection, and trust in social media content.The majority of the research articles used are from 2019 to 2021.Author PK Verma et al. [1] focused on the WELFake approach based on the classification of fake news using the voting method.The 20 shortlisted features are used then a weighted classification of selected features delivers high accuracy.Kaushal et al. [2] say clickbait is the design of the content to attract the readers.The study suggests a relation between the clickbait mechanism and readers' correlation.M. Faraon [13] assigns the score to the article to understand the trustworthiness of the article.The author believes that the credibility of an article cannot be simply calculated using a machine.Human interference is necessary to decide the credibility of the content.R. Kumar et al. [3] designed the FakeBERT model to classify fake news using CNN, where the features are detected automatically.Erwin B. et al. [13] used 10,000 samples from Twitter and Facebook, 56 accounts with 23,489 messages for performing the classification using Naive Bayes (NB), Support Vector Machine, (SVM), Logistic Regression (Logit), and the J48 algorithm (J48).Seventeen new features for Twitter and 49 features for Facebook used along with spam and sentiment are also considered.The output of the experiment is a message and the user credibility.Results are compared with earlier classifier accuracy.Andrew J et al. [9], the sampling is selected to represent the entire USA population, the professional research firm GfK's probability-based participants are used.The questionnaire is prepared with a logic to address a multi-step and multi-method process including feedback, and direct interviews.The author compared context with all sides of the content using Linkert's scale.The features are Message sidedness, Flexible thinking (FT), Need for cognition, and Interaction.
Depending upon the interaction and whether it is a two-way or three-way interaction, the balance or inclination towards the specific opinion is decided; for validation, ANOVA is used.Melissa T [10] performed a survey of 1207 participants first and then 603 participants.Analysis suggests that the Twitter accounts are available in 60% of the 603 participants, who are more educated.According to the author, credibility depends upon the tweet, media literacy, platform, and misinterpretations.The accuracy of the model is decided using the Chi-Square testing method.Additionally, recall helps in deciding the coverage of the total information.F. Liu et al. [11] address one of the important issues-Effective crises management through the narrators.To accomplish this the task force of adults in the U. S. is asked to perform a survey.This survey is performed considering seven different points.Crisis information-seeking intentions, emotions, government responsibility attribution, and perceived information credibility, are the parameters for studying the credibility of the narrator [14].The literature survey led to an important observation about the sources and the type of dataset useful for the experiment.Different types of Linguistic, platform-specific, and user-specific features are analyzed.The approach to deciding the credibility based on the survey, machine learning, or a combination of both was discussed.Details of the scale for the survey and techniques to perform the survey are observed.Machine learning techniques and the use of specific algorithms depended upon classification and regression.The potential threat of not addressing the content modeling is noted.

Materials and Methods
The experiment was performed to implement the regression technique and devise the method to predict the score for the document

Data Set Collection
The volume of the data set was 17 million records of a Movie Review data set from Rotten Tomatoes available at Kaggle.The dataset contained a review of the movie with ratings and information about freshness [8].The purpose of selecting this data set was it addressed the description of the movie (Information) in the form of review (content) [15,16].

Feature Analysis
Set was carried out by checking the missing elements in a dataset, the conversion of the dataset into a structured format for data processing, the conversion of "fresh" and "rotten" keywords to "1" and "0", respectively, for further processing as a Boolean variable [3,17].The data set contained a total of eight features Link, Review, Top_Critique, Publisher_Nme, Review_Date, Review_Score, Critique_Name [18].The analysis of these features was necessary, such as data type of these Features (Data Format), Director derived form of the values associated with features, etc. Prima fascia the type of variable and the relevance of the value of the parameter indirect (or derived) form is analyzed by using the IBM SPSS Statistic tool [8,19] (Figure 1).relevance of the value of the parameter indirect (or derived) form is analyzed by using the IBM SPSS Statistic tool [8,19] (Figure 1).The detailed information about the dataset processing is mentioned below.The data analysis of missing records is covered using IBM SPSS.Table 1, specifies the missing values, which are none in this case as per IBM, SPSS.Similarly, the analysis of review score inclusion and exclusion can be seen by looking at the report in Table 2. Feature analysis concerning data type, and dependent and independent variables are discussed in Table 3.As a target variable Review score is the most relevant feature along with the review score, Top critique and review type also reveal some important content.The detailed information about the dataset processing is mentioned below.The data analysis of missing records is covered using IBM SPSS.Table 1, specifies the missing values, which are none in this case as per IBM, SPSS.Similarly, the analysis of review score inclusion and exclusion can be seen by looking at the report in Table 2. Feature analysis concerning data type, and dependent and independent variables are discussed in Table 3.As a target variable Review score is the most relevant feature along with the review score, Top critique and review type also reveal some important content.Table 4 represents the fundamental operations applied to the numerical features (Also non-numerical features also converted to numerical values).This operation helps in identifying the outliers and getting rid of those outliers.

Regression Algorithm
Regression is necessary to predict the credibility of a given document based on the features and operations applied [6].Regression algorithms are implied first in their original format and then the processing to use the best of these algorithms is imparted.Random Forest, Lasso Regression, Linear Regression, Polynomial Regression, Support Vector Machine, and Decision Tree, XGBoost are the algorithms that are used to perform the regression of the data [20][21][22].The combination of the algorithm is used to improve the accuracy [5,7,17].The existing techniques are strengthened using stacking and then boosting.This unique mechanism helps in predicting the qualitative measures of the content accurately.

Ensemble Regression
Regression can be further improvised with the bagging, boosting-like approaches.The ensemble approach used here is a combination of conventional ensemble and hybrid models [4,12].In a normal ensemble, the best-performing algorithms are used for the ensemble approach.The ensemble contributors are again used to train the system with their best features for delivering better results.(This is different from stacking) [23].

Error Loss Calculation
Loss is associated with the difference between the expected outcome and the actual outcome.The regression cannot be assessed with any other measurement tool, as in the case of the classification algorithm, precision, recall, F-Measure, and MAP [5].To measure the performance, algorithms were compared and their respective loss was compared [24].Use of the error loss calculation with Mean Squared Error, Mean Absolute Error, and Huber Error was applied.The accuracy was further modified by applying Optimization [8].

Optimization
Regression was enhanced using an ensemble algorithm, it was improvised again with the blend of contributors of the ensemble algorithm.The loss was minimized by adding the constant for balancing and regularization [25].The optimization technique was one more unique feature of this model, which strives for the minima to reach the equilibrium point near the minima [26,27].This was possible by using the load balancing of the weight in proportion to all contributors of the regression algorithm [4,6].

Credibility Score
Credibility Score prediction-The score was calculated and credibility was identified with the help of the score.The credibility score was the outcome to measure the quality of the user-generated content.The higher the score, the better the quality of the content.

Methodology
The methodology to select feature variables, Target variables, Dataset normalization, and cleansing, is discussed in detail.The use of the individual base model for regression, combining base models to form the ensemble model to obtain higher accuracy is generalized below.

Problem Formulation
Let x = [x 1, x 2, x 3, . . .x n ] be the feature vector Where n is the total number of features Let y be the target variable Let D = (X, Y) be the dataset, where X = [x i,j ] Where i is sample number and j is feature index.i ∈ [1, m] j ∈ [1, n] m is equal to the number of samples and n is equal to the number of features Here, x i,j is the value of the j th feature in the i th sample.Y = [y i ] Where i is the sample number.i ∈ [1, m].m is equal to the number of samples and n is equal to the number of features Here, y i is the value of target variable y for i th sample

Explanation about Stacking in the Ensemble
The accuracy of the model can be increased using bagging and boosting if the algorithms are homogeneous.For heterogeneous algorithms, stacking is used.To further improve the accuracy the cost function should be designed in such a way to minimize the losses.
In the experimentation, stacking is used to consider the heterogeneous algorithms.There are two levels L-0 and L-1 [28,29], to deal with individual regression and ensemble regression.
Level -0 (L-0)-Base algorithms Linear Regression, Lasso Regression, Ridge Regression, Decision Tree [30,31].are used to train the model to calculate the credit score.The base models for L-0 are trained for a number of samples n and a number of models k.The accuracy of each individual algorithm is stated in Table 5.These accuracies are further improvised by employing the next stage Level-1 [32,33].Level -1 (L-1)-The model is trained more using the Meta-Learning regression approach.The objective function is defined with the intent to minimize the error by adding the balancing factor [34,35].The weights are identified by using backpropagation-stochastic gradient descent.

Algorithm
Algorithms used for the regression deliver the accuracy, and weight to each algorithm required.The model should be further balanced to avoid overtraining hence, balancing is used to maintain the equilibrium needed for the gradient descent.The efficiency of the algorithm is extended by minimizing the value of the curve.The gradient descent is decreased to its lowest value (Algorithm 1).The flow of the data can be well interpreted with the help of the diagram mentioned in Figure 2. The unique contribution of the work can be seen in weight calculation, error minimization, and optimization during the regression process.

Regression Loss
When the regression problem is converted into a classification problem, the accuracy of classification can be verified with Precision, Recall, F-Measure, MAP, etc.To assess the performance of the regression there is no direct assessment available [36].The error loss is calculated to know the exactness of the prediction, which is also called an R2score.In this experiment losses include Mean Squared Error, Mean Absolute Error, Mean Absolute Logarithmic Error, Mean Absolute Percentage Error, Huber loss, LogCosh, and Quantile loss.

Mean Squared Error (MSE) L2 Loss, Quadratic Loss
Sum of squared distance between expected variable and the target variable.
Graph MSE Loss Vs Prediction ("U" shape).MAE is robust for outliers, but learning is not smooth or good because of the absence of a curve.The mean is more susceptible to outliers than the median.Use MSE when the loss of data is costlier than the presence and absence of an outlier.

Mean Absolute Error (MAE)
Summation of the absolute difference between the predicted and the target variable [2].If we consider the direction, then it is Mean Bias Error (sum of residuals/errors).MAE loss Vs Predictions ("V" shape) y ij be the predicted output of j th model for i th Sample.
Step 2: Error Calculation-Objective Function is defined as follows where θ = [θ 1 , θ 2, θ 3 . . .θ n ] θ i is a weight associated with model M i .µ is a regularization parameter used to control the selected value of θ by an optimization algorithm.
Step 3: Minimization Optimization problem can be defined as follows Where 0 To obtain the values of θ j by solving the equation mentioned above where ŷij is the predicted value for i th sample x i, by j th base model M j Figure 2 indicates the flow of the data from input collected from the dataset to the final outcome.

Regression Loss
When the regression problem is converted into a classification problem, the accuracy of classification can be verified with Precision, Recall, F-Measure, MAP, etc.To assess the performance of the regression there is no direct assessment available [36].The error loss is calculated to know the exactness of the prediction, which is also called an R2score.In this experiment losses include Mean Squared Error, Mean Absolute Error, Mean Absolute Logarithmic Error, Mean Absolute Percentage Error, Huber loss, LogCosh, and Quantile loss.

Mean Squared Error (MSE) L2 Loss, Quadratic Loss
Sum of squared distance between expected variable and the target variable.
Graph MSE Loss Vs Prediction ("U" shape).MAE is robust for outliers, but learning is not smooth or good because of the absence of a curve.The mean is more susceptible to outliers than the median.Use MSE when the loss of data is costlier than the presence and absence of an outlier.

Mean Absolute Error (MAE)
Summation of the absolute difference between the predicted and the target variable [2].If we consider the direction, then it is Mean Bias Error (sum of residuals/errors).MAE loss Vs Predictions ("V" shape) The value of the gradient remains the same throughout; the learning is difficult.A small value causes a change in the gradient, which degrades the learning.Use MAE when outliers cannot be tolerated [9].

Huber Loss Function, Smooth Mean Absolute Error
Error is a small quadratic function.Delta determines the smaller value of error.Training the delta parameter in the iteration is a problem.This combines the robustness of MAE and decreased minima of MSE [15].
Problem-Missing minima because of a large curve.

LogCosh Loss
It is similar to the squared error, it is not affected by wrong predictions.It is twice differentiable but may become affected by large outliers [13].

Quantile Loss
The difference between the Actual and expected value is constant [37][38][39].Instead of exact prediction, the interval prediction gives more accurate.The non-linear model is more practical.It is an extension to MAE.Half quantile is MAE [17].

Results
The result section covers the accuracy and error loss in actual and expected output.Minima adhered to a sharp curve represents the favorable These errors can be minimized, in the Optimization section, the details about the minima and enhancement are addressed.Results received after the experimentation are represented with nine different graphs presented in Figures 3-9.Result analysis is carried out considering two different perspectives or levels.At level zero L-0 there are individual algorithms-Decision Tree, Ridge, lasso, Linear Regression algorithm, details are given in Tables 5 and 6.
Similarly, some important ensemble techniques are used to calculate the accuracy of the regression algorithm.The percentage of the prediction is more in L-1 in comparison to L-0.The accuracy of the FERMO is 96.29%.To understand the error quotient in the prediction, loss is also calculated.The details of loss analysis during the ensemble regression are as below.

Mean Squared Error (MSE)
It is the square of the difference between the actual and expected outcome.Analysis Result analysis is carried out considering two different perspectives or levels.At level zero L-0 there are individual algorithms-Decision Tree, Ridge, lasso, Linear Regression algorithm, details are given in Tables 5 and 6.Similarly, some important ensemble techniques are used to calculate the accuracy of the regression algorithm.The percentage of the prediction is more in L-1 in comparison to L-0.The accuracy of the FERMO is 96.29%.
To understand the error quotient in the prediction, loss is also calculated.The details of loss analysis during the ensemble regression are as below.

Mean Squared Error (MSE)
It is the square of the difference between the actual and expected outcome.Analysis of this error helps in minimizing the effect of overall loss.Figure 5 indicates the regression loss calculation using MSE.Ideally, it must represent the U shape.FERMO is resilient to the loss of data.The curve, which is flatter is generally away from the minima [40,41].
Figure 3 represents the MSE plotting for the movie dataset instead of the Movie Review.The sharpness in the curvature can be easily noticed.This indicates the influence of a large volume of data on a curvature.

Mean Absolute Error (MAE)
Mean Absolute Error (MAE) is the average of the difference between the expected and actual outcome.This is robust to the outlier this means only magnitude is covered, there is no coverage of direction [42].

Huber Loss Function
Huber loss function is less sensitive to the outliers, so there is no sharp angle at minima.This is also called a smooth function.This function uses the robustness of MAE and MSE.The smaller value of the delta Huber function resembles MSE and for higher values, it resembles MAE [16].

Mean Square Error Loss & Mean Absolute Error Loss
Relative plotting of the graph of Mean Square Error Loss & Mean Absolute Error Loss together, helps in understanding the errors explored using a different perspective [12,43].In Figure 6 the area under the Mean Square Error Loss is smaller than the area covered by Mean Absolute Error Loss.This is proof of the logical derivation of the square function.

Mean Square Error Loss & Huber Loss
Figure 7 indicates the plotting of Mean Error loss and Huber Loss.The representation of the Mean Square Error Loss plotted with Mean Absolute loss and Huber Loss looks similar to that shown in Figures 8 and 9, but there is a difference between the scales used in both graphs.The details can be seen in the relative error loss in Figure 7. Figure 7 indicates the increased smoothness for the value of delta [3,8].There are a lot of variants, which can be studied by varying the delta value.For the constant value of 0.5, the plotting represents the minima close to the equilibrium.This proves that there is no further training or backpropagation is needed to achieve the minima separately [44].

Mean Absolute Error Loss & Huber Loss
Figure 8 represents Mean Absolute Error and Huber loss.The association helps in depicting the minimum loss at the intersection of these graphs.

Mean Square Error Loss, Mean Absolute Error Loss & Huber Loss
Figure 9 represents the Huber Loss function with Mean Square Error Loss, Mean Absolute Error Loss can be seen at the side of the smoothness of a loss function curve [45].This represents that there are fewer outliers available.The area under the curve is the smallest in MSE and the largest in MAE.The Huber Loss function is between the MSE and MAE represented with orange color [29,46].

Conclusions
The FERMO algorithm delivers an accuracy, which is higher in comparison with all other individual regression algorithms and their combined versions, this accuracy is consid-erably higher than the approaches considered for the comparison of the performance.This enhancement is because the two unique features of the algorithm are: (i) load distribution amongst the regression algorithms and (ii) the identification of the weight factor.The validation mechanism used is exploring all types of possible losses in form of the regression.Even if there are any linear or non-linear losses, all of them are highlighted with this loss analysis.Minimum loss incurred in the process of the regression indicates maximum accuracy.The experiment proves that the accuracy is acceptable.The accuracy can be improvised by minimizing the precision loss, here training cannot be employed otherwise the model will suffer from overtraining and thus the values will be away from the gradient descent.Hence, balancing factors are used instead of further training the model.

Figure 1 .
Figure 1.System prototype for content modeling.

Figure 1 .
Figure 1.System prototype for content modeling.

Figure 6 .
Figure 6.Mean Square Error loss & Mean Absolute Error loss.

Figure 6 .
Figure 6.Mean Square Error loss & Mean Absolute Error loss.

Figure 9 .
Figure 9. Mean Square Error loss, Mean Absolute Error loss & Huber loss.

Table 1 .
Summary of Missing Values.

Table 1 .
Summary of Missing Values.

Table 3 .
Feature analysis of rotten tomatoes movie review (Data Source-Kaggle.com (accessed on 15 November 2021).

Table 4 .
Statistical analysis of features of rotten tomatoes movie review.
Step 4: Testing the dataset For test dataset x test = [x 1, x 2, x 3, . . .x n ] predicted value of ŷi for ith sample in x test as follows ŷi