Next Article in Journal
Application of Harris Hawks Optimization with Reservoir Simulation Model Considering Hedging Rule for Network Reservoir System
Next Article in Special Issue
Route Optimization of Mobile Medical Unit with Reinforcement Learning
Previous Article in Journal
Spatiotemporal Variations in Grassland Vulnerability on the Qinghai-Tibet Plateau Based on a Comprehensive Framework
 
 
Article
Peer-Review Record

A Hybrid Model for the Measurement of the Similarity between Twitter Profiles

Sustainability 2022, 14(9), 4909; https://doi.org/10.3390/su14094909
by Niloufar Shoeibi 1,*, Nastaran Shoeibi 2, Pablo Chamoso 1, Zakieh Alizadehsani 1 and Juan Manuel Corchado 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Sustainability 2022, 14(9), 4909; https://doi.org/10.3390/su14094909
Submission received: 9 February 2022 / Revised: 12 April 2022 / Accepted: 14 April 2022 / Published: 19 April 2022

Round 1

Reviewer 1 Report

Authors study Similarity Measurement of Twitter Profiles. I have some comments and questions that considering them might improve the quality of this research. 

1- this first issue, needs to be done on the last stage. However comes first in my comments.  Authors need to carefully read the paper for typos and grammatical errors and also correct presentation of some sentences. The following are just few examples. 

  • P.2 , real-time[9] --> the space between the reference and the word.
  • P.2, "... engagement, so the aim is to..." --> unclear sentence, whose aim?
  • P.2, "... For certain behaviors, while for others no...." --> unclear sentence. needs a re-write.
  • Caption figure 3, "the similar profiles and the not similar ones" --> re-write the sentence 

2- The readability of this paper needs to be improved. The length is about 19 pages, while the needed content could be presented in a shorter manner. as an example,  the whole "Introduction section" provides text that are about 3 pages. However, the needed information is about 1 page. for instance:

  • the first few paragraphs on explaining the social media. 
  • repeating this content that "we are going to apply the method on twitter" on the paper for several times. 

There are many examples on this subject, however I think the authors need to clearly present the followings without further un-necessary discussions:

  • The problem definition (which is similarity measurement)
  • The reason for research (which is indicated for instance in [5-7])
  • The general methodology 
  • The difference between existing methods (very important, this part needs to be highlighted as the novelties of this paper)
  • Explaining the proposed method clearly, but as summarized as possible. ( incase, authors find everything necessary, some technical explanations could be presented as an appendix). 
  • comparison results (with few selected existing methods that are discussed earlier). 

3- This comment is relevant to comment 2, Authors need to clearly state the novelties in the latest paragraphs of the introduction. 

4- It is better to add a comparison table in section 2 (literature review), to show the advantages and disadvantages of existing methods. 

5- Authors explain their own method clearly, however they need to compare the output results with some of the recent methods, reviewed on related works. 

6- The quality of the figure need to improve. For instance figure 3 and 5. Also authors need to use proper legends with proper markers to distinguish the results even in black and white prints. 

7- All the figures are assumed to be needed and analyzed. For instance the results presented in figure 3, are just two numbers. why use a figure to show two numbers?

 

Author Response

Dear Reviewer,

Thank you very much for your insightful comments. We appreciate your time and consideration. We hope the new version of the paper and the changes make the article convenient for publication. Below, some changes are highlighted.

 

  • The English of the paper is revised and corrected by Grammarly Premium. If more changes are needed, it is sent to a native English speaker, but it is not ready yet.
  • The minor mistakes are fixed.
  • The readability of the paper is improved.
  • The novelty of the proposed method is in defining a new distance metric to use for the similarity measurement of Twitter profiles. Our proposed hybrid model covers the similarity from three aspects, Graph of Audience, Character Computation & Behavioral Measurement, and Content Similarity. Apart from considering a more comprehensive range of information, the novelty of the work is the features considered in the model.
  • The comparison of the proposed model with state-of-the-art is depicted in table 1. 
  • Also, quantitive evaluation of the System is presented in the Case-study section.
  • All the figures are updated, the System of the art is corrected entirely.

 

We hope the improvements are satisfying to you. Once again, thank you very much.

Kind regards

Reviewer 2 Report

Main comments

  • Similarity analysis in text is a big area in NLP and the review in Section 2 (Page 4) has a very superficial touch on this concept. There are so many other works in the domain, including Latent Semantic Analysis, Topic Modelling Analysis, Short-Text similarity analysis, embedding-based analysis (word and sentence), terminology-based analysis (such as WordNet similarity measures using LCH, Wu-Palomer, Path similarity) and the like that need to be discussed in this section. I am not convinced that Euclidean distance, Pearson correlation coefficient, Spearman’s rank correlation are even related to this concept.

  • What is the relationship between the paper and text classification that you start to discuss in the middle of Page 4? You need to justify and set the scene for text classification before you jump into giving some related work on ANN-based text classification.

  • Below Algorithm 1, you mention “This architecture is optimized”. Optmized models have specific meanings in accordance with specific optimization measures and algorithms. What have you optimized the model for (what features) and using what optimization technique? Or, remove the term optimized and use “efficient” instead.

  • Section 3.1: Why 3400 tweets are extracted per user? What is this limit or set number for and why not less or more? Table 1 mentions 3240 instead! It is best to limit by the number of days rather than number of tweets. What are the Twitter API limitations you refer to (specify and explain how they are relevant)?

  • What is the basis for Figure 2? Is there a psychological study that backs it? If yes, cite the work. If not, you need to explain how this model works and how you have come up with the model. The caption of the figure is very generic and it does not give any detailed information.

  • Table 1 should contain ALL and not just SOME of the features collected. Why do you refer to these features as Advanced? What are the basic features and why are these ones advanced and how? Also, Table 1 features require some justification as to why these features are important and why they have been chosen. For instance, why can the number of mentions make users similar to each other?

  • The caption for Table 2 should be more specific and accurate. What are the numbers in the table? How should we interpret the numbers? The smaller the better? This requires a better definition of the DTW algorithm used.

  • I am a little confused between Table 3 and the Jaccard measure. What is the latter for? Also, what happens after we know the data in Table 3 for a set of users? What are the weights in Table 3 used for and how does the Jaccard measure come into the picture? This section needs clarification.

  • Under Section 3.4, you should clarify how tweet contents are prepared per user. By that, I mean do you actually put all the tweets by a user into one document and compare two documents by two users? Or, do you compare tweets individually? If the latter, how do you come up with an overall textual similarity of users?

  • In the TF-IDF model, do you create a vector per tweet? Or, do you create a vector per a collection of tweets by a user? What is the document collection here? Details missing.

  • Figure 3 has no subtitles for each of the 4 subplots. What are they? What is on the x-axis?

  • 100 politicians and 100 singers means a 100x100 matrix that has 10,000 cells. Why is 20,000 mentioned on Page 14? Clarify.

  • Table 5: What were the input features (how do you combine the three types of features)? How many data instances did you have (10K or 20K)? Clarify again where you explain the table. Also, for comparison purposes, I suggest the authors include the same classification performance set achieved using DistilBERT.

  • Your explanations end after classification in Table 5. The question remains how would you then answer the research question of yours being how to find similar or duplicate profiles. Do you then use the classification scores for deciding if two profiles are duplicate? If so, what would be your cut-off classification score for the same decision?

  • You seem to have use the hold-out validation technique for classification validation of the techniques used. What was the train-test ratio? Why did you not use more advanced validation strategies such as k-fold or bootstrapping techniques? The hold-out technique can be very much biased towards lucky or unlucky splits for any technique.

  • One last but major concern I have with this work is the Ethics around the analysis of Twitter users and their similarity with others. How has the ethical aspects analysed or alleviated? I understand you have made use of politician and singer profiles, but surely there is ethical concerns when you analyse any user’s behavioural patterns on social media.

 

Minor comments

  • Abstract reads convoluted at the end. What is the 97.24% accuracy on? What is a “duplicate profile”? You need to better and more completely define this term at the end of Page 1. This is particularly important as this paper relies on duplicate profiles.

  • By Context Similarity, I think the authors mean Content Similarity. Please revise.

  • Do not use “etc.”, be specific or remove etc. through out the paper. Do not use “many other things”, be specific or remove it.

  • At the end of Page 1, is detecting duplicate profiles a malicious activity? You need to change the wording.

  • Please refrain from using terms such as “easy” (see Section 3.4 for instance).

  • Give a citation for DistilBERT where you mention it the first time in the body of the paper.

  • Why do you need the first phrase under section 2? “Many works have been done on data analysis [21],” the phrase and the reference are redundant and not relevant.

  • Section 3: Change phrase “in a very general way” at the end of the first paragraph under this section to something that reads like “an overview of the method is given” or something similar.

  • Instead of Algorithm 1, which I do not see as an algorithm, it is best to use a diagram that illustrates the steps of the proposed method. You have this in Figure 1 and I believe Algorithm 1 is redundant and should be removed.

  • Figure 1 has two completely identical parts for two identical users. My suggestion is to remove one side completely and instead of one user at the top, use two users as input at the top. Also, do you keep the similarity measures in a DB as well? What is the purpose of keeping similarities?

  • The first paragraph of Section 3.1 is redundant.

  • Algorithm 2 not needed, this is not your work and it is based on an existing algorithm.

  • Section 3.4.1: No need to go through the literature again here, you have done it in the Related Work section already.

  • No need to explain Cosine Sim. It is well known in the literature (Page 14).

  • No need to say what year a paper was published (Page 14, the second last paragraph).

  • What is the exact number of US politicians? Do not use almost and be specific on Page 14.

  • Figure 4 does not show anything of significance or importance while you have already mentioned the numbers. The figure should be removed.

  • The English of the paper needs a full revision. There are a lot of grammatical errors in the article.

Author Response

Dear Reviewer,

Thank you very much for your insightful comments. We are delighted to receive them and appreciate your time and consideration. We hope the new version of the paper and the changes make the article convenient for publication. Below, we answer your questions;

 

  • The English of the paper is revised and corrected by Grammarly Premium. If more changes are needed, it is sent to a native English speaker, but it is not ready yet.
  • The novelty of the proposed method is in defining a new distance metric to use for the similarity measurement of Twitter profiles. Our proposed hybrid model covers the similarity from three aspects, Graph of Audience, Character Computation & Behavioral Measurement, and Content Similarity. Apart from considering a more comprehensive range of information, the novelty of the work is the features considered in the model.
  • The comparison of the proposed model with state-of-the-art is depicted in table 1. 
  • The minor mistakes are fixed.
  • The readability of the paper is improved.
  • The state-of-the-art is updated entirely. 
  • The algorithm is efficient, not optimized. Thank you for the tip.
  • There was a typo. The number of extracted tweets using official Twitter APIs is 3200, limited by the Twitter developer's team. It is explained in detail in the Data Extraction component.
  • Figure 2 is referenced to the source article.
  • Table 3 is an example of the graph of the audience, which represents the scenario above. A figure is added to the right side of the table for clarification, and more explanations are added. The weights are the frequency of the times the Source node has mentioned the target node. The directed graph is turned into a set of the user's audience for applying Jaccard's similarity.
  • The tweets are treated individually for checking how many tweets are precisely the same and in a single document together for content similarity measurement.
  • There are 100 Politicians and 100 Singers, the selection of tuples of two out of these 200 users is 19900 .
  •  The input features of the classification in the case study is the similarity measurements calculated from the proposed model. 
  • Also, the result of the distilbert is added to the evaluation table. 
  • By considering the proposed model as a more comprehensive distance measurement method, the duplicate profiles are the users with a very short distance.
  • The dataset has been divided into the training and test datasets using stratified train-test split to select an evenhanded number of samples of each category(i.e., similar, not similar) to keep the train test sets balanced and fair. 
  • From an ethical aspect, analyzing the public information of Twitter users is allowed, and Twitter encourages researchers to investigate Twitter users and content.

 

We hope the improvements are satisfying to you. 

Once again, thank you very much.

Kind regards

 

Round 2

Reviewer 1 Report

The authors have revised the paper and answered most of the mentioned comments. 

Author Response

Dear Reviewer,

Thank you very much for your insightful comments. We appreciate your time and consideration. We hope the new version of the paper and the changes make the article convenient for publication. Below, the responses to your comments are provided.

Authors’ response to comment 1:

The English of the paper is revised and corrected by a native English speaker. And the minor mistakes are fixed and the readability of the paper is improved.

Authors’ response to comment 2:

              We had reviewed the paper and reduced the redundancy of the information and removed the un-necessary parts.

  • The problem definition (A hybrid model for similarity measurement of the Twitter profile by considering a complex distance metrics.)

 

  • The reason for research (Research questions are defined in pages 2 to 3.)

 

  • The general methodology (In Proposed Method page 5 to 12.)

 

  • The difference between existing methods (The novelty of the proposed method is in defining a new distance metric to use for the similarity measurement of Twitter profiles. Our proposed hybrid model covers the similarity from three aspects, Graph of Audience, Character Computation & Behavioral Measurement, and Content Similarity. Apart from considering a more comprehensive range of information, the novelty of the work is the features considered in the model. On page 5, in state of the art, in table 1, it has been compared with other existing methods and demonstrated that the proposed system considers a broader range of information and considers more features.)

 

  • Explaining the proposed method clearly, but as summarized as possible. (The method’s explanation is rewritten, aiming to increase the clarity of the idea.)

 

  • comparison results (A comparison between the proposed method and state-of-the-art is done and summarized in table 1 and added to [page 5]. Moreover, in the case study section on [page 15], a quantitive comparison of the model's evaluation is presented.). 

 

Authors’ response to comment 3:

The novelty of the proposed method is in defining a new distance metric to use for the similarity measurement of Twitter profiles. Our proposed hybrid model covers the similarity from three aspects, Graph of Audience, Character Computation & Behavioral Measurement, and Content Similarity. Apart from considering a more comprehensive range of information, the novelty of the work is the features considered in the model.

Authors’ response to comment 4: 

Thank you very much for your comment, it has been added as you proposed.

Authors’ response to comment 5:

              Table 1 describes the lack of considering all the three aspects of the current methods reviewed in related works considered in the proposed methodology. Meaning that the broader range of information and features, together, define a new distance metric that can be used to have a quantitive measure on how similar two profiles are.

Authors’ response to comment 6:

Thank you very much for your thoughtful comment. All the figures are regenerated again and attempted to increase their clarity by reducing the complexity.

Authors’ response to comment 7:

This figure is deleted to reduce the complexity of the article, but the quantitive evaluations are added to table 6 in the case study, page 15.

 

Once again, thank you very much for your thoughtful comments and guidance. We believe that your advice helped us to improve the article very much. We hope that you find the changes well.

Kind regards

Author Response File: Author Response.docx

Reviewer 2 Report

The response letter is not acceptable. The authors need to respond to each of the comments I have made in my review and answer them individually and clearly with a format like:

Reviewer's comment 1...

Authors' response...

Reviewer's comment 2...

Authors' response...

 

Author Response

Dear Reviewer,

Thank you very much for your insightful comments. We are delighted to receive them and appreciate your time and consideration. We hope the new version of the paper and the changes make the article convenient for publication. Below, we answer your comments and clarify the changes.

Responses to the main comments:

Reviewer’s Comment 1:

Similarity analysis in text is a big area in NLP and the review in Section 2 (Page 4) has a very superficial touch on this concept. There are so many other works in the domain, including Latent Semantic Analysis, Topic Modelling Analysis, Short-Text similarity analysis, embedding-based analysis (word and sentence), terminology-based analysis (such as WordNet similarity measures using LCH, Wu-Palomer, Path similarity) and the like that need to be discussed in this section. I am not convinced that Euclidean distance, Pearson correlation coefficient, Spearman’s rank correlation are even related to this concept.

Authors' response to comment 1:

              Thank you for your comment, the related work is updated entirely and also in Table 1, the proposed method is compared to the state-of-art recent works.

 

Reviewer’s Comment 2:

What is the relationship between the paper and text classification that you start to discuss in the middle of Page 4? You need to justify and set the scene for text classification before you jump into giving some related work on ANN-based text classification.

Authors' response to comment 2:

The similarity in the text feature extraction part was in mind; however, for increasing the article's clarity, the related work section is rewritten, and we believe the update has made the paper clearer. Thank you for the suggestion.

 

Reviewer’s Comment 3:

Below Algorithm 1, you mention “This architecture is optimized”. Optmized models have specific meanings in accordance with specific optimization measures and algorithms. What have you optimized the model for (what features) and using what optimization technique? Or, remove the term optimized and use “efficient” instead.

Authors' response to comment 3:

The algorithm is efficient, not optimized. Thank you for the tip.

 

Reviewer’s Comment 4:

Section 3.1: Why 3400 tweets are extracted per user? What is this limit or set number for and why not less or more? Table 1 mentions 3240 instead! It is best to limit by the number of days rather than number of tweets. What are the Twitter API limitations you refer to (specify and explain how they are relevant)?

Authors' response to comment 4:

There was a typo. The number of extracted tweets using official Twitter APIs is 3200, limited by the Twitter developer's team. It is explained in detail in the Data Extraction component [pages 7 and 8].

 

Reviewer’s Comment 5:

What is the basis for Figure 2? Is there a psychological study that backs it? If yes, cite the work. If not, you need to explain how this model works and how you have come up with the model. The caption of the figure is very generic and it does not give any detailed information.

Authors' response to comment 5:

Figure 2 is referenced to the source article.

 

Reviewer’s Comment 6:

Table 1 should contain ALL and not just SOME of the features collected. Why do you refer to these features as Advanced? What are the basic features and why are these ones advanced and how? Also, Table 1 features require some justification as to why these features are important and why they have been chosen. For instance, why can the number of mentions make users similar to each other?

Authors' response to comment 6:

There was a typo in the caption of Table 2. When extracting Twitter data using official Twitter APIs, which is the ethical way to do it, the information of each Tweet is saved in a JSON file called Tweet Object. The Tweet object holds the information of the tweets. Its text, time that the tweet is made, the geo-location where the tweet is created, number of interactions, and also the information about the user, like number of followings/ followers, user bio, name, and last name, geo-location, and so on. This information is primary raw information that we would have from each tweet. Now imagine having the profile's timeline all this information during the time. We can calculate the advanced features from the raw data by considering the number of activity ratios during the time. The number of Tweets per hour/day, the number of retweets per hour/day, the number of mentions per hour/day, etc. (These features are presented in Table 2) 

 

Reviewer’s Comment 7:

The caption for Table 2 should be more specific and accurate. What are the numbers in the table? How should we interpret the numbers? The smaller the better? This requires a better definition of the DTW algorithm used.

Authors' response to comment 7:

The results of applying the Dynamic Time Warping algorithm have been presented in Table 3. As the output of DTW represents a distance, a lower distance indicates a higher similarity level. [Page 9]

 

Reviewer’s Comment 8:

I am a little confused between Table 3 and the Jaccard measure. What is the latter for? Also, what happens after we know the data in Table 3 for a set of users? What are the weights in Table 3 used for and how does the Jaccard measure come into the picture? This section needs clarification.

Authors' response to comment 8:

Table 4 is an example of the graph of the audience, which represents the scenario above. A figure is added to the right side of the table for clarification, and more explanations are added. The weights are the frequency of the times the Source node has mentioned the target node. The directed graph is turned into a set of the user's audience for applying Jaccard's similarity.

 

 

 

Reviewer’s Comment 9:

Under Section 3.4, you should clarify how tweet contents are prepared per user. By that, I mean do you actually put all the tweets by a user into one document and compare two documents by two users? Or, do you compare tweets individually? If the latter, how do you come up with an overall textual similarity of users?

Authors' response to comment 9:

The tweets are treated individually for checking how many tweets are precisely the same and in a single document together for content similarity measurement.

 

Reviewer’s Comment 10:

In the TF-IDF model, do you create a vector per tweet? Or, do you create a vector per a collection of tweets by a user? What is the document collection here? Details missing.

Authors' response to comment 10:

TF-IDF is applied to a single document of the tweets of the user appended with each other creating a single document.

 

Reviewer’s Comment 11:

Figure 3 has no subtitles for each of the 4 subplots. What are they? What is on the x-axis?

Authors' response to comment 11:

This figure is removed aiming to increase the clarity of the paper.

 

Reviewer’s Comment 12:

100 politicians and 100 singers means a 100x100 matrix that has 10,000 cells. Why is 20,000 mentioned on Page 14? Clarify.

Authors' response to comment 12:

There are 100 Politicians and 100 Singers, the selection of tuples of two out of these 200 users is .

 

Reviewer’s Comment 13:

Table 5: What were the input features (how do you combine the three types of features)? How many data instances did you have (10K or 20K)? Clarify again where you explain the table. Also, for comparison purposes, I suggest the authors include the same classification performance set achieved using DistilBERT.

Authors' response to comment 13:

The input features are the distance features calculated from our proposed model (from 3 aspects of behavioral and characteristics, Network of Audience and the text of the tweets) and each profile's information that exists in Tweet objects. like number of followers, number of followings, number of listed, number of posts, and many more. So, there are features related to the profile and the distance metrics calculated from our proposed model for each user. Thank you for your advice, the result of DistilBERT is also added to the Table 6, page 15.

 

Reviewer’s Comment 14:

Your explanations end after classification in Table 5. The question remains how would you then answer the research question of yours being how to find similar or duplicate profiles. Do you then use the classification scores for deciding if two profiles are duplicate? If so, what would be your cut-off classification score for the same decision?

Authors' response to comment 14:

By considering the proposed model as a more comprehensive distance measurement method, the duplicate profiles are the users with a very short distance.

 

Reviewer’s Comment 15:

You seem to have use the hold-out validation technique for classification validation of the techniques used. What was the train-test ratio? Why did you not use more advanced validation strategies such as k-fold or bootstrapping techniques? The hold-out technique can be very much biased towards lucky or unlucky splits for any technique.

 

 

 

Authors' response to comment 15:

The dataset has been divided into the training and test datasets (75% for training and 25% for testing) using stratified train-test split to select an evenhanded number of samples of each category(i.e., similar, not similar) to keep the train test sets balanced and fair. 

 

Reviewer’s Comment 16:

One last but major concern I have with this work is the Ethics around the analysis of Twitter users and their similarity with others. How has the ethical aspects analysed or alleviated? I understand you have made use of politician and singer profiles, but surely there is ethical concerns when you analyse any user’s behavioural patterns on social media.

Authors' response to comment 16:

From an ethical aspect, analyzing the public information of Twitter users is allowed, and Twitter encourages researchers to investigate Twitter users and content.

 

 

Responses to the minor comments:

Thank you very much for your sharp ll the minor suggestions are applied.

 

Reviewer’s Comment 1: Abstract reads convoluted at the end. What is the 97.24% accuracy on? What is a “duplicate profile”? You need to better and more completely define this term at the end of Page 1. This is particularly important as this paper relies on duplicate profiles.

Authors' response to comment 1: More explanations added. However the paper's is not only finding duplicate profiles, but also, to propose a new distancing measurement to calculate the similarity of two users. Finding duplicate profiles that are the users with the same charachteristical behavior, graph of audience, or the content is one of the manipulations of this measurement.  

 

Reviewer’s Comment 2: By Context Similarity, I think the authors mean Content Similarity. Please revise.

Authors' response to comment 2: Content Similarity it is! Thanks!

 

Reviewer’s Comment 3: Do not use “etc.”, be specific or remove etc. throughout the paper. Do not use “many other things”, be specific or remove it.

Authors' response to comment 3: Corrected.

 

Reviewer’s Comment 4: At the end of Page 1, is detecting duplicate profiles a malicious activity? You need to change the wording.

Authors' response to comment 4: Corrected.

 

Reviewer’s Comment 5: Please refrain from using terms such as “easy” (see Section 3.4 for instance).

Authors' response to comment 5: Corrected.

 

Reviewer’s Comment 6: Give a citation for DistilBERT where you mention it the first time in the body of the paper.

Authors' response to comment 6: Reference is added.

 

Reviewer’s Comment 7: Why do you need the first phrase under section 2? “Many works have been done on data analysis [21],” the phrase and the reference are redundant and not relevant.

Authors' response to comment 7: The paragraph is removed and the related work is rewritten.

Reviewer’s Comment 8: Section 3: Change phrase “in a very general way” at the end of the first paragraph under this section to something that reads like “an overview of the method is given” or something similar.

Authors' response to comment 8: Corrected.

 

Reviewer’s Comment 9: Instead of Algorithm 1, which I do not see as an algorithm, it is best to use a diagram that illustrates the steps of the proposed method. You have this in Figure 1 and I believe Algorithm 1 is redundant and should be removed.

Authors' response to comment 9: The algorithm 1 is rewritten and fixed.

 

Reviewer’s Comment 10: Figure 1 has two completely identical parts for two identical users. My suggestion is to remove one side completely and instead of one user at the top, use two users as input at the top. Also, do you keep the similarity measures in a DB as well? What is the purpose of keeping similarities?

Authors' response to comment 10:

Thank you very much for this suggestion! The figure is redone and it is much clearer and comprehensible this way. It is not necessary to keep the data in the mongodb database. We can have the data in the database when the number of users for the queries is high.

 

Reviewer’s Comment 11: The first paragraph of Section 3.1 is redundant.

Authors' response to comment 11: Removed.

 

 

Reviewer’s Comment 12: Algorithm 2 not needed, this is not your work and it is based on an existing algorithm.

Authors' response to comment 12: Removed.

 

Reviewer’s Comment 13: Section 3.4.1: No need to go through the literature again here, you have done it in the Related Work section already.

Authors' response to comment 13: Removed.

 

 

Reviewer’s Comment 14: No need to explain Cosine Sim. It is well known in the literature (Page 14).

Authors' response to comment 14: Removed.

 

Reviewer’s Comment 15: No need to say what year a paper was published (Page 14, the second last paragraph).

Authors' response to comment 15: Removed.

 

Reviewer’s Comment 16: What is the exact number of US politicians? Do not use almost and be specific on Page 14.

Authors' response to comment 16: Typo, fixed.

 

Reviewer’s Comment 17: Figure 4 does not show anything of significance or importance while you have already mentioned the numbers. The figure should be removed.

Authors' response to comment 17: Removed.

 

Reviewer’s Comment 18: The English of the paper needs a full revision. There are a lot of grammatical errors in the article.

Authors' response to comment 18: The English of the paper is revised and corrected by a native English speaker. And the minor mistakes are fixed and the readability of the paper is improved.

 

Once again, thank you very much for your thoughtful comments and guidance. We are glad that you did our review and we believe that your advice helped us to improve the article very much. We hope that you find the changes well.

Kind regards

 

 

Author Response File: Author Response.pdf

Round 3

Reviewer 2 Report

  • Section 2 is in a better form now; however, it still lacks cohesion. This sections should summarize related work to Twitter profile analysis for similarity and/or duplication. The focus should be on the different approaches used in the literature to find similar or duplicate Twitter users and on how similar users are identified on Twitter (semantic or rule-based text analysis, graph-based, and the similar). But instead, the section has work that are not directly on this topic, like reference 24 and reference 25. Also, in my opinion, the second last paragraph on Page 4 should become the second paragraph of Section 2 as this is where there is some motivation for the current work.

  • LDA needs a citation on Page 4

  • BERT needs a citation on Page 5.

  • Why is reference 25 relevant?

  • Page 4, P4 and P6 should go with each other, Reference 27 is now in between the two and it is not relevant t semantic analysis.

  • You should only use the full terms of Natural Language Processing (NLP) one in the first occurrence and then use NLP only. See Page 4 and fix similar cases.

  • Do not capitalize Natural language and use natural language.

  • Page 4, Chandrasekaran et al. This be followed by a citation.

  • Page 4: Park et al. presented a cosine similarity-based methodology to enhance the performance. The performance of what?

  • There are a lot of punctuation errors in the current manuscript that need to be fixed, including wrong and unnecessary capitilization of words in the middle of sentences.

  • Please remove words such as “understandably” and “Etc” from the paper.

  • Page 5: Peinelt et al. proposed a unique topic-informed BERT-based structure for pairwise semantic similarity detection. between what text pairs?

  • I cannot find where Table 1 has been referenced to and briefly discussed in the text. Every table and figure needs to be referenced in text. Also, since this is not a systematic review paper, it is best not to use the summary of the state-of-the-art for Table 1 as its caption. Instead, you can use terms such as a brief summary of some of the most related or most recent studies or something similar.

  • Figure 2 requires a more detailed caption with a little bit of introduction to the concepts that are visualized.

  • The response to my comment 6 of the previous revision needs to come in the paper. Please note when there is a review question, it is for the paper’s clarity not for the reviewer’s information. Please add the description of basic versus advanced features in the paper. Instead of advanced features, you may also use “calculated features” and justify their significance in your study and say why you are calculating them.

  • Table 3 headings should have Posted after the nouns, like Tweets Posted instead of Posted Tweets. Or, you can remove Posted.

  • Check the caption of Figure 3 and give an insightful caption with some detail.

  • In the paragraph above Table 4 where the sentence starts with “It is comprehensive…”, the term “is comprehensive” should be replaced by “should be noted” or a similar phrase.

  • I am still not sure what the output measure of section 3.3 is in terms of user similarities. I understand Table 4; however, it is not clear to me how the Jaccard measure is used here. I can see new text added that says there are two values that are calculated. But, Jaccard is applied on what sets and how are those sets extracted? Are there two separate feature values that are created in section 3.3 per pair of users? Or just one? I suggest this section be written again with a clear definition of the two separate (if two) similarity measures that are calculated.

  • In relation to the previous comment on TF-IDF (comment 10), now the question is what is the document collection here to create TF-IDF weights if you have one document per user (tweets put together in one single document)? Basically, you need to describe how the vocabulary is created for TF-IDF vectors and then what is the reference document collection so the IDF weights can be created. I assume the two documents (one per user) should be your collection and the source of the vocabulary but this needs to become clear in your paper. I am assuming N=2; or, N=the number of users you choose to monitor on Twitter and then the document collection is the set of documents for all those users over a specific period of time. Please clarify.

  • Comment 12 on the set of pairs is fine now as you are choosing singer-singer and politician-politician pairs as well.

  • Thanks for adding information to Table 6 now. There is inconsistency as to why only SVM+DistilBERT has been added and not the other two methods KNN+ DistilBERT and RF+DistilBERT. Also, check the caption now and add DistilBERT to it too.

  • In the caption of Figure 4, you can introduce the names of the three users (you have them in the text).

  • I have some concerns about the training data. It is mentioned in the paper that same pairs of politicians and same pairs of singers are labelled as similar; hence, a balanced data set of 19900 pairs, 50% similar. To make sure that this is the right training set, the paper should give a summary statistic of the values of all of the features used in the study per class (like avg and std of feature values for class similar versus class dissimilar) so we know the distributions of features over the two classes. This is because you are making the assumption that singers are similar to each other and politicians are similar to politicians. You should also note that politicians vs. singers may be two very different groups of Twitter users. In the case of less apparent differences (like between singers and models or between politicians and company CEOs) your classification approach may not perform as well. This can be part of future work.

  • My previous comment 14 is not fully answered in the paper. How will you use the classification outcome for the purpose of duplicate or similar user identification? In a real scenario, you have two users on Twitter and you are trying to see whether they are similar/same. How will your trained classification model do this? Please answer by adding some details in the paper itself.

  • Provide train-test ratios of the hold-out validation technique in the paper (75%-25% as per your previous response).

  • Not sure how reference 58 (on smart cities) is relevant to this work.

  • Working with Twitter data requires specific Ethics clearance/approval. Do the authors have such approval? This needs to be acknowledged.

 

Author Response

Dear Reviewer,

We would like to kindly thank you for your time and all your thoughtful comments. Below we provide the response to each of your comments.

  1. Section 2 is in a better form now; however, it still lacks cohesion. This sections should summarize related work to Twitter profile analysis for similarity and/or duplication. The focus should be on the different approaches used in the literature to find similar or duplicate Twitter users and on how similar users are identified on Twitter (semantic or rule-based text analysis, graph-based, and the similar). But instead, the section has work that are not directly on this topic, like reference 24 and reference 25. Also, in my opinion, the second last paragraph on Page 4 should become the second paragraph of Section 2 as this is where there is some motivation for the current work.

Authors' response to comment 1:

Thank you for your comment. The recommended changes are applied. Reference 24, which was not directly related, is removed, and the relevance of reference 25 is added to the paper.

  1. LDA needs a citation on Page 4

Authors' response to comment 2:

Done.

  1. BERT needs a citation on Page 5.

Authors' response to comment 3:

Done.

  1. Why is reference 25 relevant?

Authors' response to comment 4:

In reference 26, on page 4, paragraph 3, they proposed a novel recommender system that doesn't seem to be related to our proposed model. But in fact, it is related because it considers twitter data as input, finds the users' preferences from their Twitter content, and compares users with each other from the audience network and content points of view, making it related to our proposed system.

  1. Page 4, P4 and P6 should go with each other, Reference 27 is now in between the two and it is not relevant t semantic analysis.

Authors' response to comment 5:

Corrected. Thanks.

  1. You should only use the full terms of Natural Language Processing (NLP) one in the first occurrence and then use NLP only. See Page 4 and fix similar cases.

Authors' response to comment 6:

Thanks, all the required changes are applied.

  1. Do not capitalize Natural language and use natural language.

Authors' response to comment 7

Done. Thanks.

  1. Page 4, Chandrasekaran et al. This be followed by a citation.

Authors' response to comment 8:

Reference added, thanks.

  1. Page 4: Park et al. presented a cosine similarity-based methodology to enhance the performance.The performance of what?

Authors' response to comment 9:

The required information is added to page 4, paragraph 8—the performance of the text classification models.

  1. There are a lot of punctuation errors in the current manuscript that need to be fixed, including wrong and unnecessary capitalization of words in the middle of sentences.

Authors' response to comment 10:

                              All is corrected.

  1. Please remove words such as “understandably” and “Etc” from the paper.

Authors' response to comment 11:

All are removed. Thanks.

 

 

  1. Page 5: Peinelt et al. proposed a unique topic-informed BERT-based structure for pairwise semantic similarity detection. between what text pairs?

Authors' response to comment 12:

On page 5, paragraph 2, the pairwise semantic similarity detection is done between two short length documents.

  1. I cannot find where Table 1 has been referenced to and briefly discussed in the text. Every table and figure needs to be referenced in text. Also, since this is not a systematic review paper, it is best not to use the summary of the state-of-the-art for Table 1 as its caption. Instead, you can use terms such as a brief summary of some of the most related or most recent studies or something similar.

Authors' response to comment 13:

Thank you very much. The changes are applied.

  1. Figure 2 requires a more detailed caption with a little bit of introduction to the concepts that are visualized.

Authors' response to comment 14:

The title of the table is updated, and more details are added to the text on page 9, paragraph 3. Thanks.

  1. The response to my comment 6 of the previous revision needs to come in the paper. Please note when there is a review question, it is for the paper’s clarity not for the reviewer’s information. Please add the description of basic versus advanced features in the paper. Instead of advanced features, you may also use “calculated features” and justify their significance in your study and say why you are calculating them.

Authors' response to comment 15:

Thank you very much for your comment. The explanations are added, and the changes are applied.

  1. Table 3 headings should have Posted after the nouns, like Tweets Posted instead of Posted Tweets. Or, you can remove Posted.

Authors' response to comment 16:

Changes are applied, thanks.

 

  1. Check the caption of Figure 3 and give an insightful caption with some detail.

Authors' response to comment 17:

Details are added, thank you for helping us increasing the clarity of the article.

  1. In the paragraph above Table 4 where the sentence starts with “It is comprehensive…”, the term “is comprehensive” should be replaced by “should be noted” or a similar phrase.

Authors' response to comment 18:

Changes are applied. Thank you.

  1. I am still not sure what the output measure of section 3.3 is in terms of user similarities. I understand Table 4; however, it is not clear to me how the Jaccard measure is used here. I can see new text added that says there are two values that are calculated. But, Jaccard is applied on what sets and how are those sets extracted? Are there two separate feature values that are created in section 3.3 per pair of users? Or just one? I suggest this section be written again with a clear definition of the two separate (if two) similarity measures that are calculated.

Authors' response to comment 19:

On page 11, in the second paragraph, a paragraph is added explaining how the profile's audience is transformed into a set under the table and the figure. Then the similarity of the settings is measured with Jaccard similarity. Imagine two sets, set_A has 100 users, set_B has 170 users, and their overlap has 70. And disjunction of the sets has 100+170-70= 200 users. So, the distance of these sets is equal to 70/200 = 0.35

  1. In relation to the previous comment on TF-IDF (comment 10), now the question is what is the document collection here to create TF-IDF weights if you have one document per user (tweets put together in one single document)? Basically, you need to describe how the vocabulary is created for TF-IDF vectors and then what is the reference document collection so the IDF weights can be created. I assume the two documents (one per user) should be your collection and the source of the vocabulary but this needs to become clear in your paper. I am assuming N=2; or, N=the number of users you choose to monitor on Twitter and then the document collection is the set of documents for all those users over a specific period of time. Please clarify.

Authors' response to comment 20:

On page 13, the explanations are added in the first paragraph of the TF-IDF subsection. Thank you for your thoughtful comment, increasing the clarity of the article.

  1. Comment 12 on the set of pairs is fine now as you are choosing singer-singer and politician-politician pairs as well.
  2. Thanks for adding information to Table 6 now. There is inconsistency as to why only SVM+DistilBERT has been added and not the other two methods KNN+ DistilBERT and RF+DistilBERT. Also, check the caption now and add DistilBERT to it too.

Authors' response to comment 22:

The suggested information is added to table 6. The suggested information is added. This was an excellent comment. Thank you for your recommendation.

  1. In the caption of Figure 4, you can introduce the names of the three users (you have them in the text).

Authors' response to comment 23:

The names are added to the

  1. I have some concerns about the training data. It is mentioned in the paper that same pairs of politicians and same pairs of singers are labelled as similar; hence, a balanced data set of 19900 pairs, 50% similar. To make sure that this is the right training set, the paper should give a summary statistic of the values of all of the features used in the study per class (like avg and std of feature values for class similar versus class dissimilar) so we know the distributions of features over the two classes. This is because you are making the assumption that singers are similar to each other and politicians are similar to politicians. You should also note that politicians vs. singers may be two very different groups of Twitter users. In the case of less apparent differences (like between singers and models or between politicians and company CEOs) your classification approach may not perform as well. This can be part of future work.

Authors' response to comment 24:

Thank you for sharing your concern; very helpful to have your point of view. Figure X is added to the article, which shows the distribution of the feature values of each class (similar, not similar) converted into the 2D using the TSNE dimension reduction. Page 16 holds more information about it.

 

 

  1. My previous comment 14 is not fully answered in the paper. How will you use the classification outcome for the purpose of duplicate or similar user identification? In a real scenario, you have two users on Twitter and you are trying to see whether they are similar/same. How will your trained classification model do this? Please answer by adding some details in the paper itself.

Authors' response to comment 25:

Thank you very much for your comment. It has been entirely explained in the last paragraph of section 4, on page 17. This paper proposes a hybrid model for distance measuring Twitter profiles. The case study's result indicates that the proposed system is a convenient distance metric for comparing Twitter users. However, the trained classifier is not aiming to detect the similarity of all types of users but seeks to represent how well the distance metric is performing. To classify the similarity level of Twitter profiles with supervised classification models, another dataset is needed with more variety of users, or clustering methods based on the proposed distance metric would be helpful, which will be investigated in the future.

  1. Provide train-test ratios of the hold-out validation technique in the paper (75%-25% as per your previous response).

Authors' response to comment 26:

The information is added to the beginning of the page 16.

  1. Not sure how reference 58 (on smart cities) is relevant to this work.

Authors' response to comment 27:

Reference 27 shows a use case of an openML platform, Deepint, which is a platform giving insights about the data and the model of any AI system. In the future, we aim to deploy our research on Twitter at the production level; our goal is to build our dashboards using Deepint. So as you mentioned, the case study may not be directly in the same context, but the tool providing insights is handy. Thank you for sharing your concern with us.

  1. Working with Twitter data requires specific Ethics clearance/approval. Do the authors have such approval? This needs to be acknowledged.

Authors' response to comment 28:

It is a fundamental matter for social media analytics, and it’s the first step to learning and considering this area. On page 8, in subsection 3.1., the process of requiring Twitter’s permission for investigating the Twitter data is wholly explained.

Basically, you need to upload your research proposal and supporting documents, and Twitter gives you access if they find your research convenient. Thank you for sharing your interest in this topic, many works do not pay attention to this issue, but we acquired Twitter’s permission and used official

Also, the English of the paper is revised and corrected by a native English speaker.

 

Once again, thank you very much for your thoughtful comments and guidance. We are glad that you did our review and we believe that your advice helped us to improve the article very much. We hope that you find the changes well.

Kind regards

 

 

Author Response File: Author Response.pdf

Round 4

Reviewer 2 Report

At the end of Page 13, it is mentioned that “So the length of the vector in TF-IDF is much smaller than DistilBERT, a fixed size of 768 values”. This may not be true. TF-IDF creates a vocabulary of words that occur within the documents. Depending on the number of tweets that are put into the documents, the length of the vocabulary may grow and thus, the vector size for TF-IDF will also grow. However, with BERT, the length of the input text (X, which you need to specify in the paper) and the embedding size (768) are both fixed. We do not necessarily know which technique has a larger output vector length. Some information should be added to this section and the justification of performance outputs need to be adjusted. In fact, because the two documents (one per user tweets) have a lot of tweets and you are cutting them down to X tokens only, this can cause a huge loss on larger input data for DistilBERT (only the first X tokens/words are seen from a large set of tweets by a user as input to a DistilBERT model) and hence, the lower performance. You should add information on DistilBertConfig to the paper too so we know for instance what X is (see https://huggingface.co/transformers/v2.10.0/model_doc/distilbert.html for some information on the configuration and you will find “max_position_embeddings” is what I am referring to as X here).

 

Section 3.3 has been improved but it still requires clarification. If Section 3.3 calculates two audience-based similarity measures, just say it. According to Table 4, User 2 and User 1 should be similar because they both have Weight=2 for Target=User 3? The Jaccard part is clear now.

 

In Section 3.4.2, below Equation 4, N should be set to 2 because when you have two Twitter users to compare with each other, N=2 always. Also, df_t will have a value of 1 or 2 again because there are only 2 documents always. Unless, you have more documents and I am not sure how.

 

The text below Table 7 mentions 97.24% accuracy. This is not inline with the format of the results in Table 7 which uses the format 0.00. Please make sure the formats are the same and consistent both in the table and within the text; otherwise, it is not clear where 97.24% is coming from.

 

TSNE is usually abbreviated as t-SNE in the literature. Please correct. Also, why do you have References 57 and 58 for t-SNE? The reference for t-SNE should go back to the previous works, such as van der Maaten, L.J.P.; Hinton, G.E. (Nov 2008)."Visualizing Data Using t-SNE". Journal of Machine Learning Research. 9: 2579–2605.

 

The English of the newly added sentences (highlighted in the latest submission) and other parts of the paper needs attention. The letter cases still need attention throughout the paper as well. See, for instance, twitter in Section 4, the second sentence starting with “and” in the second paragraph under Section 4, the caption of Table 6 (each models optimization should be the optimization of each model) and there are more to be fixed.

 

Please use correct letter casing. If there is an abbreviation, the words should be capitalized. For instance, Page 13, next sentence prediction(NSP) should be Next Sentence Prediction (NSP) with a space after Prediction. Please check all abbreviations in the text.

 

Please make sure there is a space between a word and a parenthesis throughout the text, see Page 13 for instance.

 

Please do not use vague phrases such as “so on” and use specific terms or terminate the sentence (see Page 18).

 

The caption of Table 7 was not fixed after my last comment. It needs to have DistilBERT added to it too.

Author Response

Dear Reviewer,

We would like to kindly thank you for your time and all your thoughtful comments. Below we respond to each of your comments.

  1. At the end of Page 13, it is mentioned that "So the length of the vector in TF-IDF is much smaller than DistilBERT, a fixed size of 768 values". This may not be true. TF-IDF creates a vocabulary of words that occur within the documents. Depending on the number of tweets that are put into the documents, the length of the vocabulary may grow, and thus, the vector size for TF-IDF will also grow. However, with BERT, the length of the input text (X, which you need to specify in the paper) and the embedding size (768) are both fixed. We do not necessarily know which technique has a larger output vector length. Some information should be added to this section, and the justification of performance outputs needs to be adjusted. In fact, because the two documents (one per user tweets) have a lot of tweets and you are cutting them down to X tokens only, this can cause a huge loss on larger input data for DistilBERT (only the first X tokens/words are seen from a large set of tweets by a user as input to a DistilBERT model) and hence, the lower performance. You should add information on DistilBertConfig to the paper, too, so we know, for instance, what X is (see https://huggingface.co/transformers/v2.10.0/model_doc/distilbert.htmlfor some information on the configuration, and you will find "max_position_embeddings" is what I am referring to as X here).

Authors' response to comment 1:

Thank you for your comment. The sentence indeed confuses the audience; it is removed from the paper. 

 

  1. Section 3.3 has been improved, but it still requires clarification. If Section 3.3 calculates two audience-based similarity measures, just say it. According to Table 4, User 2 and User 1 should be similar because they both have Weight=2 for Target=User 3? The Jaccard part is clear now.

Authors' response to comment 2:

Required changes are applied. Thank you for your comment.

 

  1. In Section 3.4.2, below Equation 4, N should be set to 2 because when you have two Twitter users to compare with each other, N=2 always. Also, df_t will have a value of 1 or 2 again because there are only 2 documents always. Unless you have more documents, and I am not sure how.

Authors' response to comment 3:

It has been modified, thanks.

 

  1. The text below Table 7 mentions 97.24% accuracy. This is not in line with the format of the results in Table 7, which uses the format 0.00. Please make sure the formats are the same and consistent both in the table and within the text; otherwise, it is not clear where 97.24% is coming from.

Authors' response to comment 4:

Corrected. Thanks.

 

  1. TSNE is usually abbreviated as t-SNE in the literature. Please correct. Also, why do you have References 57 and 58 for t-SNE? The reference for t-SNE should go back to the previous works, such as van der Maaten, L.J.P.; Hinton, G.E. (Nov 2008)." Visualizing Data Using t-SNE." Journal of Machine Learning Research. 9: 2579–2605.

Authors' response to comment 5:

The abbreviation is corrected, and the reference is added. Thanks.

 

  1. The English of the newly added sentences (highlighted in the latest submission) and other parts of the paper needs attention. The letter cases still need attention throughout the paper as well. See, for instance, Twitter in Section 4, the second sentence starting with "and" in the second paragraph under Section 4, the caption of Table 6 (each model's optimization should be the optimization of each model), and there are more to be fixed.

Authors' response to comment 6:

Thanks, all the required changes are applied. Also, the English of the paper is revised and corrected by a native English speaker.

 

  1. Please use the correct letter casing. If there is an abbreviation, the words should be capitalized. For instance, on Page 13, the next sentence prediction(NSP) should be Next Sentence Prediction (NSP) with a space after Prediction. Please check all abbreviations in the text.

Authors' response to comment 7

Thank you very much. The changes are applied.

 

  1. Please make sure there is a space between a word and a parenthesis throughout the text; see Page 13, for instance.

Authors' response to comment 8:

All are checked and corrected, thanks.

 

  1. Please do not use vague phrases such as "so on" and use specific terms or terminate the sentence (see Page 18).

Authors' response to comment 9:

All are removed. Thanks.

 

  1. The caption of Table 7 was not fixed after my last comment. It needs to have DistilBERT added to it too.

Authors' response to comment 10:

Corrected, thanks.

 

Once again, thank you very much for your thoughtful comments and guidance. We are glad that you did our review, and we believe that your advice helped us improve the article very much. We hope that you find the changes well. 

 

Kind regards

 

Author Response File: Author Response.pdf

Back to TopTop