Sentiment Analysis of Social Survey Data for Local City Councils

Big data analytics can be used by smart cities to improve their citizens’ liveability, health, and wellbeing. Social surveys and also social media can be employed to engage with their communities, and these can require sophisticated analysis techniques. This research was focused on carrying out a sentiment analysis from social surveys. Data analysis techniques using RStudio and Python were applied to several open-source datasets, which included the 2018 Social Indicators Survey dataset published by the City of Melbourne (CoM) and the Casey Next short survey 2016 dataset published by the City of Casey (CoC). The qualitative nature of the CoC dataset responses could produce rich insights using sentiment analysis, unlike the quantitative CoM dataset. RStudio analysis created word cloud visualizations and bar charts for sentiment values. These were then used to inform social media analysis via the Twitter application programming interface. The R codes were all integrated within a Shiny application to create a set of user-friendly interactive web apps that generate sentiment analysis both from the historic survey data and more immediately from the Twitter feeds. The web apps were embedded within a website that provides a customisable solution to estimate sentiment for key issues. Global sentiment was also compared between the social media approach and the 2016 survey dataset analysis and showed some correlation, although there are caveats on the use of social media for sentiment analysis. Further refinement of the methodology is required to improve the social media app and to calibrate it against analysis of recent survey data.


Introduction
Many government agencies are moving towards a data-driven business strategy so that they can exploit the benefits from analysing the masses of big data they have accumulated over time in order to evolve into smart communities [1]. The large volume of big data and the rapid velocity at which it is collected can be challenging for analysts to handle. Public administrators need to consider whether their organisations have the capability to handle vast volumes of structured and unstructured data, perform data analysis to generate actionable insights, and derive meaning from the data to support evidence-based decisions [2].
The term 'smart city' can be defined as a 'place where traditional networks and services are made more flexible, efficient, and sustainable with the use of information, digital, and telecommunication technologies, to improve its operations for the benefit of its inhabitants' [3]. These authors also state that emerging technologies such as the Internet of Things (IoT) and big data are interrelated and contribute to the progression of smart cities by increasing efficiencies and responsiveness. The application of insights

Related Work
Cities and councils use social surveys to engage with their citizens, and these surveys can be voluminous requiring big data analysis techniques. Machine learning is beginning to be applied to such social surveys. Buskirk et al. provided an introduction to the potential of machine learning (ML) techniques for survey research [5]. Ramirez et al. applied ML to public health surveys to explore the use of language for English and non-English responses and showed that there are differences between responses in different languages with heterogeneity among the Asian languages [6].
Another approach is to use sentiment analysis, an application of natural language processing (NLP) that identifies expressions reflecting opinions towards issues [7]. Opinions are generally categorised as polar values of positive, negative, or neutral and sentiment analysis determines their polarity and strength. This automatically extracts useful information from sources such as posts on blogs or social media. Such analysis can be conducted at the phrase, sentence, and document level [8]. This is particularly relevant for text-based surveys such as the Casey Next survey. Algorithms have been developed to extract sentiment from text responses to survey questions [9,10]. Lexicon-based approaches compare words in the text to those in a lexicon that contains positive and negative words and their associated intensities [11]. Sentiment is then determined from the matches between the text and the lexicon.
Social media provide a ready source of big data that can be mined and analysed [12]. It is estimated that there may be over 3 billion active social media users by 2023, a third of the world's population, with an estimated 800 million from China and 450 million from India [13]. Opinion mining from social media is employed by the media giants for research and marketing purposes using their own tools [14].
Data from social media such as Twitter can also be used to determine more immediate, although less formal, responses [15]. For example, Yigitcanlar et al. used geolocated Twitter analysis to study perceptions about smart city concepts and technologies in Australia [16]. Attitudes towards the COVID-19 pandemic have also been studied using Twitter [17,18]. These authors used word clouds to present key words and phrases indicative of public sentiment on the emerging pandemic. A similar approach was adopted by Kankanamge et al. to assess disaster severity from flooding using Twitter feeds [19]. These authors found that the analysis could track disaster severity fluctuations over time and demarcate highly impacted disaster zones through message geolocation.
Twitter was selected since it is a rapidly expanding social media microblogging service, provides a free API for practitioners, and can be considered as open data compared to other social media such as Facebook and Instagram that are more restrictive [16]. These other social media services use a mix of media formats including text, audio, imagery, and video, whereas Twitter is primarily text-based enabling NLP techniques such as sentiment analysis to be applied. Furthermore, it is also easier to extract keywords from tweets than Facebook comments, most likely because of the use of hashtags, mentions, and emoticons [14].

Methodology
The general methodology followed in this paper can be summarised as:

1.
Identify data sources that can be used to provide social survey data; 2.
Determine and trial analysis tools on several different datasets; 3.
Develop a suite of apps to analyse the datasets for sentiment; 4.
Further develop an app that can access social media feeds to determine current sentiment using findings from the historic dataset analysis; 5.
Host the apps on a website for customer access; 6.
Compare historic sentiment results with current sentiment where possible, and thus demonstrate that this approach can provide useful results for a local city council.

Council Datasets from Social Surveys
Suitable datasets to meet the project objectives were obtained from the Australian Government's open data portal [20]. There were few such datasets that met the requirements; however, several were identified. These comprised:

•
The CoM social indicators survey that was conducted in 2018 and involved over 1200 residents. The dataset was used to measure outcomes for social indicators such as health, wellbeing, community sense, and connectedness of its citizens. The responses are quantitative in nature and thus not suitable for NLP analysis. The dataset is available online from the CoM data portal [21].

•
The Casey Next short survey that was conducted in 2016 and a contractor report produced [22]. Over 3600 responses were collected as a combination of structured and unstructured data records that are mostly qualitative in nature. The dataset is available online from the Australian open data portal [20].
The CoM refers to the Central Business District of the City of Melbourne in the state of Victoria, Australia, while the CoC is a large, rapidly growing municipality in southern Victoria [23] with population predicted to increase to over 500,000 people by 2041. The CoC has a significant 'smart city' program to improve liveability outcomes arising from lack of infrastructure, lack of employment opportunities, and other socio-economic issues. The CoM dataset was analysed first. This dataset has mainly quantitative numerical data however, and it was found that it could not be used to produce a sentiment analysis.
It should be noted that the CoM and CoC datasets are less than 1 MB in size and, thus, not strictly big data. However, the Twitter archive analysed in the later part of the project falls into this category.

Analysis and Web Hosting Tools
For the purpose of conducting data analysis for this project, the RStudio integrated development environment (IDE) tool was selected for the following reasons: (1) RStudio is an open-source freeware system [24], which suited the requirement that the project had no allocated budget, (2) the developers assigned to this project had previous extensive experience in its use, and (3) Python scripting is also supported by RStudio. Python libraries were also used in standalone codes for analysis of the CoM dataset including Pandas, NumPy, and Scikit-learn with Matplotlib and Seaborn used for graphical output [25].
A requirement for the Twitter API solution was to obtain access to real-time data from Twitter. The rtweet R library was embedded into R code running within RStudio [26]. This library has multiple functions that can be used to query social media posts based on multiple filters and settings. Findings from the RStudio data analysis on the historic 2016 dataset were used to inform keywords for searches of current Twitter data.
The web apps were developed using R code and integrated within a Shiny package [27], and then hosted on a website using Hostinger [28]. These systems provided free service for the duration of the student project, although with limited access and security. The Shiny free plan only allowed five apps to be hosted on a single account and 25 active hours use per month [29].

Analysis of Data
The analysis was conducted in three parts. In the first part, the CoM dataset was analysed. In the second part, the Casey Next survey dataset was analysed using RStudio to extract sentiment values for key issues in 2016. In the third part, these findings were used to inform Twitter analysis for more current sentiment analysis. The latter two parts required the development of web apps to perform this analysis.

Preliminary Analysis of City of Melbourne Social Indicators Dataset
The CoM dataset was analysed first. This dataset has mainly quantitative data however, and it was determined that it could not be used to produce a sentiment analysis that needs less structured textual responses. Therefore, the dataset was not used other than for initial analysis to determine the direction of the project. Some sample analysis is shown in Figures 1 and 2 below. Analysis of this dataset was performed using Python libraries that were readily applicable to the quantitative responses.  This analysis shows that CoM residents generally engage in physical activity, although there are wide fluctuations among the cohorts. Not surprisingly, the 18-24 year cohort is the most active, whereas the 65+ year cohort is the least active.  This analysis shows that CoM residents generally engage in physical activity, alt hough there are wide fluctuations among the cohorts. Not surprisingly, the 18-24 yea cohort is the most active, whereas the 65+ year cohort is the least active.  Figure 2 shows the responses to the indigenous cultural awareness question. Thi indicates that most are unaware of the two indigenous tribes (Wurundjeri and Boonwurrung) or at least cannot name them both. Indeed, the Wurundjeri tribe was his torically more populous and has a far higher profile in the modern city than the Boonwurrung.  Figure 1 shows the responses for the physical activity topics. There are 18 different cohorts that include residential area, gender, and age profile. There are six questions for each of the 18 cohorts: A.
Participate in adequate physical activity. B.
Participate in sports and exercise activities. C.
Participate in sports and exercise activities in the CoM. D.
Participate in organised physical activity. E.
Participate in physical activity organised by a fitness, leisure or indoor sports centre. F.
Participate in physical activity organised by a sports club or association.
This analysis shows that CoM residents generally engage in physical activity, although there are wide fluctuations among the cohorts. Not surprisingly, the 18-24 year cohort is the most active, whereas the 65+ year cohort is the least active. Figure 2 shows the responses to the indigenous cultural awareness question. This indicates that most are unaware of the two indigenous tribes (Wurundjeri and Boonwurrung) or at least cannot name them both. Indeed, the Wurundjeri tribe was historically more populous and has a far higher profile in the modern city than the Boonwurrung.

Analysis of Casey Next Dataset
An app was created to measure sentiment on the four questions in the Casey Next dataset and to display the results in a dynamic visualization. The app was developed using RStudio with R libraries for sentiment analysis. These questions are:

1.
'What Kind of Place Would You Like Casey To Be In 2041?' 2.
'If You Could Change One Thing In Casey What Would It Be?' 3.
'Describe Your Vision For Casey In Three Words?' 4.
'What's Most Important To You?' The app allows for sentiment scores of respondents to be graphed against postal code, ward, age, gender, and suburb. Results can also be filtered against specific search words such as 'environment', 'health', or 'safety' so that the user can gauge sentiment against topics such as wellbeing, crime, and affordability.
The high-level process is shown in Figure 3. The data first needed to be cleansed since there were columns with missing data. An analysis of data columns that had over 5% of the values missing satisfied the 'missing completely at random' (MCAR) definition.
MCAR can be defined as the missing values having no relationship between the intended or observed values [30]. The pairwise approach was then applied to exclude the missing values from the analysis only when specific columns of data were being used for analysis as this method can produce fewer bias results for MCAR data.
code, ward, age, gender, and suburb. Results can also be filtered against specific sea words such as 'environment', 'health', or 'safety' so that the user can gauge sentim against topics such as wellbeing, crime, and affordability.
The high-level process is shown in Figure 3. The data first needed to be cleansed si there were columns with missing data. An analysis of data columns that had over 5% the values missing satisfied the 'missing completely at random' (MCAR) definiti MCAR can be defined as the missing values having no relationship between the intend or observed values [30]. The pairwise approach was then applied to exclude the miss values from the analysis only when specific columns of data were being used for analy as this method can produce fewer bias results for MCAR data. Sentiment analysis was conducted using the Syuzhet package [9,31]. The get_se ment() function in Syuzhet was applied to determine sentiment score for each user sponse to the examined data fields by finding all the lexicon words contained in e string and calculating the arithmetic sum of the component sentiment values [9]. This ables the cumulative sentiment scores to be plotted against a variety of parameters. T ggplot2 library was used to create the visualizations.
The user interface for the web app is shown as Figure 4. The user can select one the survey questions, provide a search term and filter, and also modify the colour by lecting from a palette. A bar chart of sentiment values is then produced from the d using these input criteria. Sentiment analysis was conducted using the Syuzhet package [9,31]. The get_sentiment() function in Syuzhet was applied to determine sentiment score for each user response to the examined data fields by finding all the lexicon words contained in each string and calculating the arithmetic sum of the component sentiment values [9]. This enables the cumulative sentiment scores to be plotted against a variety of parameters. The ggplot2 library was used to create the visualizations.
The user interface for the web app is shown as Figure 4. The user can select one of the survey questions, provide a search term and filter, and also modify the colour by selecting from a palette. A bar chart of sentiment values is then produced from the data using these input criteria.  The overall sentiment of Casey residents to the question 'What Kind of Place Would You Like Casey To Be In 2041?' is shown in Figure 5 as a percentage. This is overwhelmingly positive with nearly 50% reporting 'very positive' and less than 5% reporting negative (3.2%) or very negative responses (0.8%). The overall sentiment of Casey residents to the question 'What Kind of Place Would You Like Casey To Be In 2041?' is shown in Figure 5 as a percentage. This is overwhelmingly positive with nearly 50% reporting 'very positive' and less than 5% reporting negative (3.2%) or very negative responses (0.8%). The overall sentiment of Casey residents to the question 'What Kind of Place Would You Like Casey To Be In 2041?' is shown in Figure 5 as a percentage. This is overwhelmingly positive with nearly 50% reporting 'very positive' and less than 5% reporting negative (3.2%) or very negative responses (0.8%). The top 25 most common words for the future vision for Casey are shown in Figure  6. This was determined using R code embedded within RStudio. The top five words are 'safe', 'clean', 'friendly', 'family-oriented', and 'community'. An app for viewing word clouds was also created. A word cloud is defined as 'a computer visualization technique used in the text mining methods of documentation summarization' [32]. Here, the R wordcloud2 package was used [33]. Sample word clouds are presented in Figure 7. A word cloud is an attractive visual representation of textual data with the importance of each word indicated by font size or colour. In the word cloud app developed, one can determine the count number of any word in the cloud by hovering a mouse over the text. An app for viewing word clouds was also created. A word cloud is defined as 'a computer visualization technique used in the text mining methods of documentation summarization' [32]. Here, the R wordcloud2 package was used [33]. Sample word clouds are presented in Figure 7. A word cloud is an attractive visual representation of textual data with the importance of each word indicated by font size or colour. In the word cloud app developed, one can determine the count number of any word in the cloud by hovering a mouse over the text.
computer visualization technique used in the text mining methods of documentation summarization' [32]. Here, the R wordcloud2 package was used [33]. Sample word clouds are presented in Figure 7. A word cloud is an attractive visual representation of textual data with the importance of each word indicated by font size or colour. In the word cloud app developed, one can determine the count number of any word in the cloud by hovering a mouse over the text.  The word clouds in Figure 7 show the 25 words most commonly used for responses to the relevant questions. These word clouds can lead to actionable insights for the CoC. Figure 7a indicates that many Casey residents would like their city to be safe (694), clean (264), family-oriented (226), and friendly (234); while Figure 7b shows that the residents consider public transport (236), roads (254), and traffic (184) to be important issues. Here, the count for each keyword is included. Note that the Casey Next survey was conducted in 2016 pre-pandemic so 'safe' implies safe from crime or physical harm rather than safe from COVID-19 infection. The word clouds in Figure 7 show the 25 words most commonly used for responses to the relevant questions. These word clouds can lead to actionable insights for the CoC. Figure 7a indicates that many Casey residents would like their city to be safe (694), clean (264), family-oriented (226), and friendly (234); while Figure 7b shows that the residents consider public transport (236), roads (254), and traffic (184) to be important issues. Here, the count for each keyword is included. Note that the Casey Next survey was conducted in 2016 pre-pandemic so 'safe' implies safe from crime or physical harm rather than safe from COVID-19 infection.
In Figure 7b the word 'better' is included in large font; this would have been frequently used as a qualifier for other characteristics. To further analyse this, a data dump of the 'If you had the power to change just one thing,' responses by the 29-38 and 39-48 age groups, filtered by the word 'better', showed that many of the responses containing this word also contain words such as 'transport', 'roads', 'traffic', 'infrastructure', and 'public transport'.
Word clouds were also generated for the other two survey questions. Question 1 ('What Kind of Place Would You Like Casey To Be In 2041?') highlighted 'safe', 'place', 'community', and 'friendly', while question 4 ('What's Most Important To You?') highlighted 'people', 'environment', 'connecting', 'transport', and 'improving'. These word clouds show that the CoC needs to take action to enhance its residents' safety and transport options, particularly with better traffic management and improved roads and public transport.
The two web apps described here were integrated within Shiny [27] and hosted on a website using Hostinger [28].

Analysis Using Social Media
The Casey Next survey analysis provided an initial understanding of the type of information the CoC was interested in. This analysis enabled the identification of keywords common among survey respondents. These keywords formed the basis for the Twitter investigation. A Twitter API was created and embedded within RStudio. Tweets can contain many different types of data, such as information relating to news, media, retweets, or replies to posts, and these can be formatted as text, imagery, video, or audio. Here, much of these data were filtered to target only tweets that did not contain media or news. Tweets were also targeted by geolocation to ensure they originated within the CoC. Tweets that contained links to other web pages were excluded from the analysis.
The RSentiment library was employed for analysis [34]. This has multiple functions that can be applied to determine the sentiment of sentences from Twitter. The tool assigns the data into six categories from very positive (1) to sarcastic (6). The sentiment values are presented as a bar chart.
Once the accuracy was verified and Twitter analysis optimised in RStudio for the CoC tweets, an interactive visualization was developed to enable a user to input search words into a web form. Again, Shiny enabled the translation of plots and other code into HTML format [27]. This allowed publication of the application and hosting online by embedding the code into a web page for clients to use to obtain a pulse check on the city, depending on the topic(s) of interest. Due to restrictions and limitations in terms of publishing findings from the Twitter data, the web app was embedded within a website hosted with Hostinger to secure and restrict access only to the approved audiences of the findings [28].
The Twitter API was customised to collect data from tweets posted within a 25 km radius of the centre of the CoC and for specific phrases containing keywords such as 'transport', 'COVID', and 'jobs'. Data from the past 6-9 days were used for the analysis. A high-level overview of the Twitter app development process is shown in Figure 8. A key requirement is data cleansing. The approach adopted here was to filter out much of the data that were not relevant to the analysis and to target posts that did not contain media or news. Original posts were used since the context could not be understood from replies, retweets, or news. Tweets that contained links to other web pages were also excluded since they could not be contextualised as relating to the city and often contained news or advertisements. This was achieved by filtering tweets that contained the strings 'news' or 'https'. To target tweets that commented on the CoC, the filter for geolocation code was employed to ensure that only tweets from the CoC were examined and analysed.
The RSentiment library was initially used for sentiment analysis. This tool categorises the tweet data into five ranges from Very Positive to Very Negative. The 'tidytext' library was used [35] with several other lexicons that are effectively databases of words that can be used to measure sentiment. However, only minor differences were found for sentiment values with the four different lexicons.
The web app interface is shown in Figure 9. The ggplot2 library was used to draw the bar charts. Note that the graphs below were redrawn from the raw output, such as that shown in Figure 9, so that the sentiment values range from Very Positive to Very Negative and are given as percentages. The web app plots the sentiment values in order of size. A key requirement is data cleansing. The approach adopted here was to filter out much of the data that were not relevant to the analysis and to target posts that did not contain media or news. Original posts were used since the context could not be understood from replies, retweets, or news. Tweets that contained links to other web pages were also excluded since they could not be contextualised as relating to the city and often contained news or advertisements. This was achieved by filtering tweets that contained the strings 'news' or 'https'. To target tweets that commented on the CoC, the filter for geolocation code was employed to ensure that only tweets from the CoC were examined and analysed.
The RSentiment library was initially used for sentiment analysis. This tool categorises the tweet data into five ranges from Very Positive to Very Negative. The 'tidytext' library was used [35] with several other lexicons that are effectively databases of words that can be used to measure sentiment. However, only minor differences were found for sentiment values with the four different lexicons.
The web app interface is shown in Figure 9. The ggplot2 library was used to draw the bar charts. A key requirement is data cleansing. The approach adopted here was to filter out much of the data that were not relevant to the analysis and to target posts that did not contain media or news. Original posts were used since the context could not be understood from replies, retweets, or news. Tweets that contained links to other web pages were also excluded since they could not be contextualised as relating to the city and often contained news or advertisements. This was achieved by filtering tweets that contained the strings 'news' or 'https'. To target tweets that commented on the CoC, the filter for geolocation code was employed to ensure that only tweets from the CoC were examined and analysed.
The RSentiment library was initially used for sentiment analysis. This tool categorises the tweet data into five ranges from Very Positive to Very Negative. The 'tidytext' library was used [35] with several other lexicons that are effectively databases of words that can be used to measure sentiment. However, only minor differences were found for sentiment values with the four different lexicons.
The web app interface is shown in Figure 9. The ggplot2 library was used to draw the bar charts. Note that the graphs below were redrawn from the raw output, such as that shown in Figure 9, so that the sentiment values range from Very Positive to Very Negative and are given as percentages. The web app plots the sentiment values in order of size. Figure 10 shows the current overall sentiment of CoC (as of October 2021) with a filter Note that the graphs below were redrawn from the raw output, such as that shown in Figure 9, so that the sentiment values range from Very Positive to Very Negative and are given as percentages. The web app plots the sentiment values in order of size. Figure 10 shows the current overall sentiment of CoC (as of October 2021) with a filter that includes transport, safety, community, parks, roads, infrastructure, and health. This is achieved by entering the string 'transport safety community parks roads infrastructure' as the search input. This shows that CoC residents are generally optimistic about their city with the majority expressing positive sentiment (very positive 50% and positive 12%), while only about 22% report negative or very negative sentiment. This can be compared with the findings in Figure 5 Figure 11 shows Twitter analysis of sentiment as a percentage for single topics of COVID and jobs, respectively. Figure 11a shows a mixed response to residents' views of the CoC's measures to contain the pandemic, while Figure 11b shows that the majority of CoC residents believe that the council is generally addressing the employment issue adequately (62% positive or very positive compared with only 23% negative or very negative). It should be noted that the Twitter app is still in an experimental configuration and needs further refinement. A further caveat on the use of Twitter is that only a fraction of the population uses it (less than 25% in Australia [36]), and most of those lean strongly to the progressive side of politics [37]. This would help explain the low numbers of relevant tweets for specific topics and would also skew the results of any Twitter analysis.

Discussion
Full details of both the methodology applied and the results are provided in the Swinburne University student report [38]. The website could be applied to provide regular monthly 'pulse check' updates to councils by using the Twitter API. The pulse check can serve as a useful social indicator tool to measure the immediate sentiment of residents on specific issues that may relate to health and wellbeing or other areas of interest for the council.
The analysis of the 2016 dataset has led to several actionable insights for the CoC.  Figure 11 shows Twitter analysis of sentiment as a percentage for single topics of COVID and jobs, respectively. Figure 11a shows a mixed response to residents' views of the CoC's measures to contain the pandemic, while Figure 11b shows that the majority of CoC residents believe that the council is generally addressing the employment issue adequately (62% positive or very positive compared with only 23% negative or very negative). It should be noted that the Twitter app is still in an experimental configuration and needs further refinement. while only about 22% report negative or very negative sentiment. This can be compared with the findings in Figure 5 for expectations of Casey in 2041. The greater negative responses in 2021 compared with 2016 may be attributable to the COVID-19 pandemic.  Figure 11 shows Twitter analysis of sentiment as a percentage for single topics of COVID and jobs, respectively. Figure 11a shows a mixed response to residents' views of the CoC's measures to contain the pandemic, while Figure 11b shows that the majority of CoC residents believe that the council is generally addressing the employment issue adequately (62% positive or very positive compared with only 23% negative or very negative). It should be noted that the Twitter app is still in an experimental configuration and needs further refinement.  A further caveat on the use of Twitter is that only a fraction of the population uses it (less than 25% in Australia [36]), and most of those lean strongly to the progressive side of politics [37]. This would help explain the low numbers of relevant tweets for specific topics and would also skew the results of any Twitter analysis.

Discussion
Full details of both the methodology applied and the results are provided in the Swinburne University student report [38]. The website could be applied to provide regular monthly 'pulse check' updates to councils by using the Twitter API. The pulse check can serve as a useful social indicator tool to measure the immediate sentiment of residents on specific issues that may relate to health and wellbeing or other areas of interest for the council.
The analysis of the 2016 dataset has led to several actionable insights for the CoC. A further caveat on the use of Twitter is that only a fraction of the population uses it (less than 25% in Australia [36]), and most of those lean strongly to the progressive side of politics [37]. This would help explain the low numbers of relevant tweets for specific topics and would also skew the results of any Twitter analysis.

Discussion
Full details of both the methodology applied and the results are provided in the Swinburne University student report [38]. The website could be applied to provide regular monthly 'pulse check' updates to councils by using the Twitter API. The pulse check can serve as a useful social indicator tool to measure the immediate sentiment of residents on specific issues that may relate to health and wellbeing or other areas of interest for the council.
The analysis of the 2016 dataset has led to several actionable insights for the CoC. The results suggest that the CoC should: • consider safety, cleanliness, and family friendliness as its top priorities; • invest further in the environment providing more parks and green spaces; • improve transport options for their residents; • address health and safety issues.
Extensions for the web apps include: • expanding on the Twitter query function, which currently only takes tweets from the last 6-9 days and improving the geolocation of tweets; • studying how sentiment changes over time; • improving the accuracy of the sentiment analysis performed; • building drill-down capabilities into the visualizations to promote better analysis; • creating additional visualizations using RStudio to derive clearer insights from Twitter; • analysing the upcoming Casey Next survey data due to be released at the end of 2021 to compare against the findings based on the 2016 survey data.
A comparison of sentiment from the survey data with current estimates from the Twitter app showed some correlation. However, the Twitter sentiment estimates should be treated with caution. It is unclear as to whether sentiment changes are due to analysis errors, insufficient statistics, or external influences such as the COVID-19 pandemic. Further refinement of this tool is required to improve the accuracy and reliability of sentiment values from social media.

Conclusions
Big data analytics was applied to several local council datasets in Australia. The CoM Social Indicators survey dataset was analysed using quantitative analysis techniques, while sentiment analysis was performed for the Casey Next dataset from the CoC, a local government area in the state of Victoria, Australia. This comprised the analysis of a social survey conducted in 2016 and also the development of a web app able to evaluate sentiment from current social media feeds. While primarily a student project, this investigation has revealed valuable insights that can be exploited by a local city council. The sentiment analysis web APIs serve as a basis for future opportunities with other local councils that are engaged in surveying their citizens for opinions on critical issues. The techniques described here can readily be applied elsewhere.
The use of social media to determine sentiment is in its early stages but shows promise as a means of quickly assessing community opinion on critical or controversial issues. Future work could comprise refinement of the social media app and comparisons of its predictions with recent social surveys.  Institutional Review Board Statement: Ethical review and approval were not applicable for this study.
Informed Consent Statement: Consent to access Twitter data was provided by registering for a Twitter Developer account and adhering to the developer agreement and policy; https://developer. twitter.com/en/developer-terms/agreement-and-policy (accessed on 10 November 2021).

Data Availability Statement:
The datasets analysed were published on the Australian Government Open Data repository: https://data.gov.au (accessed on 10 November 2021).