Analyzing Social Media Data using Sentiment Mining and Bi-gram Analysis for the Recommendation of YouTube Videos

: In this work we combine sentiment analysis with graph theory to analyze user posts,


Introduction
Recommender systems (RS) are intended to provide the online user with advice, 13 reviews and opinions from previous purchasers on products and services mainly through 14 methods such as collaborative filtering (CF) [1]. The main RS objective using CF is to 15 persuade users to buy items or services they have not previously bought/seen before based 16 on the buying patterns of others. This can be achieved by ranking either the item-to-item 17 similarity or the user-to-user similarity and then predicting the top scoring product that 18 ought to appeal to the potential buyer. Unfortunately, CF has a number of limitations 19 such as the cold-start problem i.e. generating reliable recommendations for those with few 20 ratings or items. However, this issue can be alleviated to some extent by reusing pre-trained 21 deep learning models and/or using contextual information [2]. Since CF is generally an 22 open process, they can be vulnerable to biased information or fake information [3,4]. Fake 23 user profiles can easily manipulate recommendation results by giving the highest rates to 24 targeted items and rate other items similar to regular profiles. This behavior is called a 25 "shilling attack" [5]. 26 Initially launched in 2005, YouTube has seen an exponential growth of submitted 27 videos and is the most popular platform for viewing material that informs, educates and 28 entertains it's users. YouTube is a free video sharing service allowing users to view online 29 videos and also for them to develop and upload their own materials to share with others 30 [6,7]. However, for many YouTube contributors the opportunity to earn money from their 31 channels popularity is a great incentive. To earn money from YouTube, a contributor must 32 have 1,000 subscribers and at least 4,000 watch hours in the past year. Contributors can then 33 apply to YouTube's Partner Program and monetize their channel. However, YouTube keeps 34 careful surveillance on any mechanism that artificially inflates the number of comments, 35 views or likes. Unscrupulous contributers often achieve increased rankings by using bots 36 or automatic systems or even presenting videos to unsuspecting viewers. 37 The objective of our work is to demonstrate that a recommendation engine can be 38 used to provide users with reliable YouTube videos based on initial keyword searches. The 39 Pre-processing, mapping and integration of data  topic of interest is global warming/climate change but the system could be applied to any 40 subject. The objectives are two-fold, once we can identify their sentiment/opinions on 41 global warming we can provide users with authoritative videos with scientific credence 42 based on their beliefs. Then, we can present users with authoritative videos representing 43 the opposite stance. The intent is to balance out the debate with evidence they would 44 perhaps not necessarily seek out. Out intention is not to change opinions but to help users 45 become more aware of the issues. 46 To achieve these objectives, we combine sentiment analysis and graph theory to 47 provide deeper insights into YouTube recommendations. Rather than use different software 48 platforms, we combine several R library's into a unified system, making overall integration 49 easier. The overall system workflow is shown in Fig 1. An initial search topic is defined and 50 fed into the API's of the three platforms (Twitter, Reddit and YouTube). The resulting posts 51 are preprocessed and parsed, the text data is then analysed by graph theoretic measures 52 that provide statistical metrics of user posts and how they interact. The sentiments of user 53 posts are used to create topic maps which reflect common themes and ideas these users 54 have. Ratings of YouTube videos and provenance of their sources are estimated to provide 55 some indication of their validity and integrity.

56
The main contribution of this work is threefold, first we integrate sentiment mining 57 with graph theory providing statistical information on the posters and contributers, we 58 also use up-votes and down-votes as a recommendation source, finally we create a logical 59 structuring of the twitter, youtube and reddit data using topic maps. Topic modelling, is 60 necessary since most topics of interest will comprise a mixture of words and sentiments 61 which is a feature of human language. Therefore some overlapping of concepts will occur, 62 so an unsupervised classification method is required. We use Latent Dirichlet allocation 63 (LDA) which is commonly used for fitting a topic models [8].

64
The remainder of this paper is structured as follows: section two describes related 65 work and recent advances in recommender systems, section three outlines the social media 66 data used, we then describe in section four the computational methods used. Section five 67 presents the experimental results and the discussion and finally section six presents the 68 conclusions and future work. 69 70 Here we discuss related work for recommender systems, sentiment analysis and graph 71 theoretic methods. 72 2.1. Recommender systems 73 We can say that Recommender systems can be categorized into three main groups; 74 such as content based recommender systems, collaborative recommender systems and 75 hybrid recommender systems. One of the first and most predominant is the Amazon 76 recommendation system which has undergone many refinements over the past 20 years 77 [9]. The RS are generally trained from historical data and provide the customer with 78 potentially useful feedback with products or services they may like. The details of the RS 79 algorithm used by YouTube is unknown but it is generally believed to employ deep neural 80 learning [10]. However, a recent study revealed it to contain biases and is a major source 81 of misinformation on certain health related videos [11]. Another issue, which we do not 82 tackle in this paper are the attacks on recommender systems to either down vote or up vote 83 content [12].

84
Our system can be classed as a hybrid, similar work to ours include Kim and Shim who 85 proposed a recommender system based on Latent Dirichlet Allocation (LDA) using proba-86 bilistic modelling for Twitter [13]. The top-K tweets for a user to read along with the top-K 87 users that should be followed are identified based on LDA. The Expectation-Maximization 88 (EM) algorithm was used to learn model parameters. Abolghasemi investigated the issues 89 around human personality in decision-making as it is plays a role when individuals discuss 90 to reach a group decision when deciding which movie to watch [14]. They devised a 91 three-stage approach to decision making, they used binary matrix factorization methods 92 in conjunction with an influence graph that includes assertiveness and cooperativeness 93 as personality traits, they then applied opinion mining to reach a common goal. We use 94 similar metrics to judge personalities based on tenor/tone of language used and their likes 95 dislikes.

96
A similar approach was taken by Leng et al who were researching social influence and 97 interest evolution for group recommendations [15]. The system they developed (DASIIE), 98 is designed to dynamically aggregate social influence diffusion and interest evolution 99 learning, they used Graph Neural Networks as the basis of their recommendation system. 100 Th neural network approach allowed them to integrate the group members role weights 101 and expertise weights enabling the decision-making process to be modeled simultaneously. 102 Wu et al. have examined the technique of data fusion for increasing the efficiency of 103 item recommender systems. It employed a hybrid linear combination model and used a 104 collaborative tagging system [16]. 105

106
Over the past 10 years or so sentiment analysis has seen massive expansion both 107 in practical applications and research theory [17][18][19]. The process of sentiment mining 108 involves the preprocessing of text using either simple text analytics or the more complex 109 NLP such as the Stanford system [20]. The text data can be organised by individual words 110 or at the sentence and paragraph level by the positive or negative words it is comprised of 111 [21]. Words are deemed to be either neutral, negative or positive based on the assessment 112 of a lexicon [22,23]. Sentiment analysis is employed in many different areas from finance 113 [24,25] to mining student feedback in educational domains [26,27]. It has also been used to 114 automatically create ontologys from text [28]. Sentiment analysis has been used to examine 115 the satisfaction within the computer gaming community, observing features in games they 116 liked/disliked [29]. We have seen commercial applications for the automated mining of 117 customer emails/feedback/reviews for improving satisfaction with products or services 118 that has seen the tremendous growth [30,31]. Twitter is often used as a source of data for 119 sentiment mining on many topics [32], however it is with tweets collected over long time 120 periods that tend to reveal interesting trends and patterns [33]. For example sentiment 121 analysis has been applied to monitoring mental health issues based on Tweets [34].

122
Work by Kavitha is similar to ours as it considers YouTube user comments based on 123 their relevance to the video content given by the description [35]. They build a classifier 124 that analyses heavily liked and disliked videos, similarly we use counts to help rate the 125 videos. They also consider spam and malicious content, we also filter out posts that contain 126 sarcastic and profane content as they are unlikely to contain much cogent information. [11]. 127 A more serious issue was considered by Abul-fottouh in the search for bias in YouTube 128 vaccination videos. They discovered that pro-vaccine videos (64.75%) outnumbered anti-129 vaccine (19.98%) videos with perhaps 15.27% of videos being neutral in sentiment. It is 130 unsurprising that YouTube tended to recommend neutral and pro-vaccine videos than anti-131 vaccine videos. This implies YouTube's recommender algorithm will recommend similar 132 content to users with similar viewing habits and similar comments. This is related to the 133 sentiment work of Alhabash who investigated cyber-bullying on YouTube, this involved 134 examining comments, virality and arousal levels on civic behavioral patterns [36]. The 135 findings concluded that people are more committed/interested in topics or comments that 136 have negative sentiments, hence cyber-bulling videos appear to have disproportionate 137 effect on users. Further work by Shiryaeva et al investigated the negative sentiment (anti-138 values) in YouTube videos, here the viewpoint was taken from the lens of linguistics to 139 reveal grammar and style indicative of certain behaviors and intentions [37]. Although, 140 the work was not automated the authors were able to identify 12 anti-values that were 141 characteristic of bad behavior. This area of computer science uses statistical measures to gather information about the 144 connectivity patterns between the nodes (which can be people, objects or communications) 145 which can reveal useful insights into the dynamics, structure and relationships that may 146 exist [38,39]. Numerous areas have benefited from graph theory such as computational 147 biology and especially social media which has received a great deal of attention from 148 researchers [40]. The most notorious incident was the FaceBook/Cambridge Analytica 149 scandal which involved the misuse of personal data [41]. However, this particular case 150 served to highlight the power of machine learning and interconnected data to influence 151 individuals. In social media analysis, individuals are connected to friends, colleagues, 152 political, financial and personal web interests all of which can analyzed by organizations to 153 improve services, products or detect trends and opinions [42].

154
Graph theory was used by Cai to examine the in-degree of posters, the intention was 155 to identify if Schilling attacks were occurring in user posts [43]. Each user was assigned 156 a "suspicion" rating based on their in-degree and their behavior characteristics such as 157 diversity of interests, long-term memory of interest, and memory of rating preference). 158 The graph information was fed to a density clustering method and malicious users were 159 generally identified. A similar approach was taken by Cruickshank to use a combination 160 of graph theory and clustering on Twitter hash-tags [44]. The method investigated the 161 application of multiple different data types that can be used to describe how users interact 162 with hashtags on the COVID-19 Twitter debate. They discovered that certain topical clusters 163 of hashtags shifted over the course of the pandemic, while others were more persistent. 164 The same effect (homophilly) likely to be true of climate change debate, for example the 165 HarVis system of Ahmed uses graph theory to untangle frequent from infrequent posters 166 to assist a better understanding of the authors/posters ranking [45]. This is an important 167 point as it is best to weigh authoritative heads instead of just counting them.

168
The use of graph-like structures such as Graph Convolutional Networks (GCN) is 169 becoming more popular, this approach has the flexibility and power to model many social 170 media problems. These are more powerful than standard graph theoretic methods but 171 come with a computational burden and requirement for more data. The use of GCN is also 172 receiving attention for identifying Schilling attacks in recommender systems [46]. Another 173 issue is the informal language used in posts and other characteristics of this type of data, for 174 example, Keramatfor et al understood that short posts such as Tweets have dependencies 175 upon previous posts [47,48]. To model Tweet dependencies requires the combination of 176 data such as textual similarity, hashtag usage, sentiment similarity and friends in common. 177 In table 1 we provide a short qualitative comparison with the most similar recommen-178 dation systems to ours. The difference is that our system uses a greater variety of social 179 media data and uses profiling and a wider variety of computational methods

181
Here we describe our data sources, how they are pre-processed and integrated prior 182 to building machine learning models and implementing the recommendation system. 183 Twitter, Reddit and Youtube posts are searched based on climate change keywords, then 184 downloaded using the appropriate APIs, the posts are cleaned of stopwords, stemming, 185 punctuation and emojis. A separate corpus, consisting of term-document-matrix is created 186 for each data source. We then build topic maps for each corpus, the optimum number is 187 generated from a range of 10-100 potential topics. The most optimum number is selected by 188 calculating the harmonic mean for each number. We did not analyze the social media data 189 to determine if any content was generated by bots. The social media companies are well 190 aware of the issues and have developed bot detection software [50,51]  Reddit is a social news aggregation platform and discussion forum, users can post 196 comments, web links, images, and videos. Other users can up/down vote these posts and 197 engage in dialog, the site is well known for it's open and diverse nature. User posts are 198 organized by subject into specific boards called communities or subreddits. The communities 199 are moderated by volunteers who set and enforce rules specific to a given community, they 200 can remove posts and comments that are offensive or that break the rules, they also keep 201 discussions on subject topic [53][54][55]. Reddit is becoming very popular as statistics show 202 from the SemRush web traffic system which estimates Reddit to be the 6th most visited 203 site in the USA [56]. We text mine Reddit for posts and sentiment pertaining to the issues 204 surrounding the climate change debate [57][58][59][60][61]. The reddit data was collected between 205 December 2022 and February 2023, the reddit API limited extraction with rate limits, we 206 used the R interface (RedditExtractoR) [62].

207
The reddit data consists of two structures, the comments and the threads. The com-208 ments data consists of the following variables: url, author, date, timestamp, score, upvotes, 209 downvotes, golds, comment and the comment-id. The threads data has further informa-210 tion pertaining to other users actions on the posts such as total-awards-received, golds, 211 cross-posts, and other user comments.      The three data sets from Twitter, YouTube and reddit now must be preprocessed prior 240 to sentiment analysis. In algorithm 1 we show the stages of processing the three data 241 sources (Twitter T txt , Reddit R txt and Y txt YouTube ). In lines 1 to 4 each text is converted 242 into Corpus and in line 5 they under go removal of stop words, stemming, and removal of 243 punctuation and non-ascii text. Lines 6 to 11 creates the topic maps for each Corpus using 244 a for..loop to build a series of topic maps from 10 to 100 maps. Lines 12 and 13 uses the 245 harmonic mean metric to judge the optimum number of maps for each Corpus. Finally line 246 14 returns the optimum topic maps and related data structures. We use the sentimentR package written by Rinker [65], it incorporates the lexicon 249 developed by Ding et al [23]. The lexicon consists of words which have been rated as neutral, 250 Algorithm 1 Data transformation for text mining Input: Raw text for twitter T txt , reddit R txt , youtube: Y txt ; Output: Corpus for twitter C t , reddit C r , youtube C y ; Topic Maps for each CorpusTM t ; TM r ; TM y ; optimum number of topic maps OT t ; OT r ; OT y 1: Initialize MinWordFreq ← 5 2: Create corpus C t ← T txt 3: Create corpus C r ← R txt 4: Create corpus C y ← Y txt 5: Preprocess C t , C r , C y ← removal[stopwords, stemming, punctuation, non-ascii] 6: repeat 7: if (words >= MinWordFreq) then 8: Build TM in C = ∀C. (1)

Graph modelling 265
The igraph package developed by Csardi and Nepusz provides a comprehensive 266 package for conducting analysis into graph theory, it is available across several languages 267 and is regularly updated and maintained [72]. It allows statistics to be computed from 268 the graph network based on the nodes and connectivity patterns. Useful statistics include 269 closeness, betweenness, and hubness amongst others. Furthermore, it is possible to detect 270 community structure where certain nodes strongly interact and form cohesive clusters 271 which may relate to some real-world characteristics about the network. Graph theoretic 272 methods can be applied to any discipline where the entities of interest are linked together 273 through various associations or relationships. Other graph approaches, different to ours 274 involve graph neural networks (GNNS) which are a powerful way of expressing graph 275 data [73]. 276 Hub nodes have many connections to other nodes and therefore of some importance 277 or influence, the deletion of a hub node is more likely to be catastrophic than deletion of 278 a non-hub node. This is a characteristic confirmed in many real-world networks which 279 are typically small world networks with power law degree (number of edges per vertices) 280 distributions [38].

281
The concept of the shortest path is important to centrality measures and can be defined 282 as when two vertices i and j are connected if there exists a sequence of edges that connect i 283 and j. The length of a path is its number of edges. The distance l(i, q) between i and j is the 284 length of the shortest path connecting i and j [39]. The closeness centrality of a given node 285 i in a network is given by the following expression: Betweenness centrality is a measure of the degree of influence a given node has in 287 facilitating communication between other node pairs and is defined as the fraction of 288 shortest paths going through a given node. If p(v i , v j ) is the number of shortest paths from 289 node i to node j, and p(v i , v k , v j ) is the number of these shortest paths that pass through 290 node k in the network, then the BC of node k is given by:

Generating the topic models 292
Latent Dirichlet Allocation (LDA) is commonly used to generate topic models [74]. We 293 use the R Topic model package developed by Grun and Hornik [8,75]. Equation 4 defines 294 the stages, there are three product sums K, M, N that describe the documents, topics and 295 terms.
Where: P(W, Z, θ, φ, α, β) is the overall probability of the LDA model; ∏ K i=1 P(φ i ; β) 297 generates the Dirichlet distribution of the topics over the terms; while ∏ M j=1 P(θ j ; α) cal-298 culates the Dirichlet distribution of the documents over the topics; the probability of a 299 topic appearing in a given document is given by ∏ N t=1 P(Z j,t θ j ); while the probability of a 300 word appearing in a given topic is calculated by P(W j,t |φZ j,t ). The parameters W, Z, θ, φ 301 where θ and φ hold the document-term matrices; while α, β are the Dirichlet distribution 302 parameters; the indices i, j, t keep track of the number of topics, terms and documents. The 303 term W is the probability that a given word appears in a topic and Z is the probability that 304 a given topic appears in the document [74]. 305 We generate individual topic models for Twitter data, Reddit data and Youtube data. 306 The optimum number of topics k is determined using a harmonic mean method determined 307 by Griffiths and Steyvers [76,77]. This is shown in equation 5.
Where: w represents the words in the corpus w, and the model is specified by the 309 number of topics K. Gibbs sampling provides the value of p(w|z, K) . p(w|K) by taking 310 the harmonic mean of a set of values of p(w|z, K) when z is sampled from the posterior 311 p(z|w, K). Where n w k is the frequency of word w has been assigned to topic k in the vector z 312 and Γ is the standard Gamma function.

314
The last component in our system is the RS engine, this contains the information 315 from the sentiment analysis, the statistics from user ratings and user connectivity patterns 316 from graph analysis. We use nonnegative matrix factorization (NMF) to generate the 317 process of collaborative filtering (CF) [78,79]. Strictly speaking NMF is related to Principal 318 Components Analysis (PCA) which is typically used for dimensionality reduction but 319 still keeps a meaningful representation of the solution [80]. Both methods use similar 320 matrix transforms that are linear combinations of the other variables but NMF has a 321 stricter constraint that the values should not be negative. This is an advantage because it 322 enables a clearer interpretation of the factors involved since in many applications negative 323 values would be counterintuitive such as negative website visits or negative human height. 324 There are also improvements in sparsity for feature detection and imputation of missing 325 information. We integrate our recommender system without the framework of the R 326 package by Hahsler, this allows easier testing and comparison [81]. 327 The objective is to determine a matrix of ratings of V, where the columns represent 328 the users and the rows represent the video ratings. The use of NMF will approximate 329 this matrix by taking the matrix of users W and the matrix of videos H. The majority of 330 V entries are unknown, these can be predicted by NMF using W x H ≈ V. See fig 6,  We use nonnegative matrices W and H, of rank k from which V is approximated by 333 the dot product operator. Where k is a parameter set usually smaller than the number 334 of rows and columns of V. The trade-off with k is a fine balance able to capture the key 335 features of the data but to avoid overfitting. In previous work we modified NMF as a 336 data integration method [82] other variations are typically used for data integration with 337 heterogeneous data, especially in chemistry and community detection [83,84].

339
The flow of data and processing the posts begins with the conversion of raw text from 340 Twitter, Reddit and Youtube into Corpora, basically term-document-matrices (as described 341 in algorithm 1). Once they are processed we can extract topic models from the Corpora to 342 aid our understanding of the posts by this logical grouping of keywords.

343
As an example of sentiment analysis using the R package sentimentr on twitter posts 344 is shown in fig 7. The posts are identified by number (1-10), they are ranked as either 345 positive(green), neutral (grey) or negative(red), each with a number denoting the strength 346 of the sentiment. There are three word sentiment lookups available for the Bing, NRC, and 347 Afinn dictionary's, each with differing number of words rated and with differing sentiment 348 values attached to each word. This can be at the word level, sentence level or the entire post 349 (paragraph). As can be seen, the twitter data shown here represents a number of opinions 350 on the climate change debate. 351 We can see that comment 1 is rated at zero sentiment since the sentence is fairly neutral 352 in its wording. In comment 2 we find the first sentence is neutral but the second sentence 353 has a positive sentiment word (optimistic) and is rated +.082. Comment 3 is more negative 354 because of the words scam, scammer and hoax, rated at −147.

355
The next stage is to develop topic models holding keywords that are coherently related 356 to key concepts and will be data mined for bigrams. The optimum number of topic models 357 for each Corpora is determined using the Harmonic mean described in equation 5. In fig 8 358 the optimum numbers are presented with 24 for Twitter, 44 for Reddit and 31 for YouTube 359 concepts and issues. We used LDA to generate the topicmaps with a value starting at 10 360 up to 100 possible topic maps, so at the first iteration 10 topic maps would be selected to 361 describe the Corpora, then 11, 12, 13 until 100 topic maps are generated. Beyond a certain 362 point adding more topic maps simply degrades performance, and when the harmonic 363 mean decreases that is the number of maps to use. However, the Harmonic mean method 364 has known instabilities but is generally robust enough.

365
In fig 9 five out of 24 twitter topicmaps are shown, generally the terms climate and 366 change are present throughout some of the 24 topicmaps. Topicmap 2 is generally related 367 to energy consumption of fossil fuels such as oil and gas. Topicmap 3 is concerned with 368 public health and net zero. Topicmap 4 has gathered words on environmental impact and 369 statements issued by the Intergovernmental Panel on Climate Change (IPCC). Topicmap 5 370 seems to have grouped human rights and social justice as key themes.

371
To augment the statistics and text mining, we also generated Wordclouds which 372 perhaps give a better visualization and understanding of the main themes that dominate 373 user posts. Individual word frequencies are used to highlight the important themes. The 374 more frequent a word then its size increases. In fig 10 wordclouds for twitter, reddit and 375 youtube are presented. Clearly climate and change totally dominate user posts for twitter, 376 while reddit and youtube have a wider range of concepts with more or less equal frequency 377 of occurrence. Only words that appear with at least five occurrences are displayed.

378
The next stage is to build graph theoretic models of bi-grams of co-occurring words 379 building of up a picture of sentiment relating to each Youtube video. Graph models of 380 Twitter and Reddit are also constructed to support the ratings/rankings of the videos in 381 terms of the esteem/trust in which the videos producers are held. In table 3  Hubness which indicates for each word the relative connectivity importance. The other 384 columns have identical values -mod (modularity) column refers to the structure of the 385 graph and can take a range of 0.0 to 1.0 indicating there is structure and not a random 386 collection of connections between the nodes. Nedges indicates the number of connections 387 in this small network, nverts is the number of nodes in the network. The transit column 388 refers to the transitivity or community strength, it is a probability for the network to have 389 adjacent nodes interconnected.

390
As the graph is highly disconnected (bigrams linking to other bigrams) it has zero for 391 all entries. Degree refers to the average number of connection per node and of course is 392 around 2.0, diam the length of the shortest path between the most distanced nodes. Connect 393 refers to fully connectedness of the graph and in this case it is not. Closeness of a node 394 measures its average distance to all other nodes, high closeness scores suggest a short 395 distances to all other nodes. Betweenness detects the influence a given node has over the 396 flow of information in a graph. The Density represents the ratio between the edges present 397 in a graph and the maximum number of edges that the graph can contain. The Hubness is a 398 value to indicate those nodes with larger number of connections than an average node.

399
In table 4 we have shown the basic statistics of several YouTube videos. We collect data 400 such as the ID of the video e.g. in the first row, oJAbATJCugs would normally be used to se-401 (a) Twitter wordcloud.
(b) Reddit wordcloud (c) Youtube wordcloud Figure 10. Word Table 3. Graph theoretic statistics on YouTube bi-graph/bigrams on five users lect the video in a web browser using the string "https://www.youtube.com/watch?v=oJAbATJCugs". 402 The number of comments received for each video is collected, along with the average num-403 ber of likes, we also collect the number comments that had zero likes and we should note 404 that this does not imply the video was disliked only that the person sending the comment 405 neglected to select like irrespective of their feelings for the video. The number of unique 406 posters making comments is also recorded. Next we perform sentiment analysis, examining 407 overall sentiment for the video and then breaking down comments into neutral, negative 408 and positive sentiments. The total number of sentiments (positive, negative and neutral) 409 are based on the sentence level and therefore we have more than the overall number of 410 comments i.e. number of posts.  In fig 11, the YouTube bigrams are displayed, we only show those words that have 412 at least 100 co-occurrences based on key topic map groupings. The bigrams for YouTube 413 are more strongly linked to coherent topics and follow a logical pattern of subjects with 414 more linkages between bigrams. Generally, the comments on YouTube are more calm and 415 balanced with some thought given to the subject of global warming. 416 Figure 11. Bi-gram chart of YouTube linked pairs of words.
In fig 12, the situation for Twitter bigrams is a more complicated, furthermore because 417 of the large number of posts we filter the number of word co-occurrences to 200 before 418 they can appear on the plot. However, it provides a richer source of data illuminating the 419 issues and concerns once the most frequently occurring words are revealed based on key 420 topic map groupings. The general trend for twitter posts seems to contain a lot of off-topic 421 issues such as legal aspects and gun violence. Another issue is the text limits on tweets (280 422 characters) which may cause posters some constraints in their dialog. The text limit has 423 been raised for fee paying subscribers to 4,000 characters.  Having gathered statistics from sentiment analysis of the topic maps, comments and 431 bigrams of paired common words we now structure the data to build the recommendation 432 engine. The difficulty we face is that the matrix of items (videos) and users is very sparse, 433 this is alleviated to a certain extent by generic profiling of users from Twitter and Reddit 434 data.

435
The rating matrix is formed from the YouTube user rankings of videos, these are 436 normalized by centering to remove possible rating bias by subtracting the row mean from 437 all ratings in the row. In  In operation the recommender system makes suggestions for selected users of YouTube 457 based on their ratings of previous videos, their comments (if applicable) and related 458 statistics. In table 8 we highlight 20 suggestions based on 5 users selected at random. Each 459 user may obtain a differing number of recommendations Column one, identifies the video, 460 column two gives the user ID (selected at random), column three gives the YouTube video 461 ID (which can be pasted into a browser), column four gives the title of the video, column 462 five gives the number of views and finally column six gives the recommender score. Where 463 the videos stand in relation to climate change is obvious from the titles, with the exception 464 of video 20 which appears to take a neutral stance. The score or ranking of a video is 465 based on a value between 0.0 and 1.0, formed by the statistics generated and YouTube 466 recommendations. Experimentally we have determined that values below 0.5 are unlikely 467 to be of interest as we detected videos that are off-topic and little related to global warming. 468

469
In this paper, we constructed a recommendation system based on sentiment analysis 470 on topic maps, bigrams and graph analysis. The main source of data and was from the posts, 471 comments and rating statistics attached to each YouTube video. From this data we were 472 able to profile those agreeing with the global warming situation and those who were more 473 skeptical. Although our model is successful in certain conditions it has major limitations, 474 mainly we cannot usually identify posters from one forum to another. Posters typically 475 have different user-names and so we would unlikely to be able extract further information, 476 hence we went for a generic person profiling. We tried to alleviate that drawback by 477 attempting to judge the character, sentiment and beliefs of the users. Future work must 478 deal with improving user profiling based on their sentiment, type of language they use 479