1. Introduction
“You can’t manage what you don’t measure”—Peter Drucker [
1]
The Global Industrial Revolution has significantly improved the quality of life and the development of the global economy [
2]. But at what cost? The economic and quality of life improvements have been accompanied by the exponential increase in air pollutant emissions, which directly impact human health [
3]. Global air pollution causes an estimated
trillion US dollars loss to the global economy, approximately
of the global Gross Domestic Product (GDP) [
4]. Poor air quality significantly affects human health, and given the insidious character of air quality on health, there is a severe need for awareness and precautionary measures against it [
5].
This situation is particularly exacerbated in developing countries. Due to escalated air pollution, Southeast Asia and African countries face the highest burden, and poor air quality has become the second-highest risk factor of death in 2021 [
6]. The issue of poor air quality is clearly visible in India, where some 13 of the top 20 polluted cities are located, per the World Air Quality Report 2024 [
6]. According to the Global Burden of Disease (GBD) Assessment 2017 [
7], India is one of the leading nations where increasing air pollution is a significant health risk factor, causing millions of unnatural deaths and health hazards in urban regions like Delhi, Kanpur, Kolkata, Kochi, and suburban areas like Korba (Chattisgarh) and Ghaziabad (outside of Delhi).
Air pollution in India’s leading cities is primarily caused by particulate matter (PM), particularly the primary pollutant
. In New Delhi, it is reported that improved air quality to the WHO standard can add 7.8 years to the residents’ life expectancy [
8,
9]. Our decision to focus on
was guided by both data availability and public health relevance for later downstream tasks. Given the choice among
,
and
, the ultrafine particles
are of emerging interest, but it is not routinely monitored or widely reported in Delhi or across India. The lack of consistent datasets at this scale makes meaningful correlation analysis infeasible within the scope of our study. Considering
, the coarse fraction (2.5–
m) is typically less directly linked to severe cardiopulmonary and mortality outcomes compared to
. Studies, which include WHO Global Air Quality Guidelines (2021) [
5], identify
as the dominant pollutant of concern because of its ability to penetrate deeply into the alveolar regions of the lungs and its strong association with health burden indicators. Thus,
is selected as the major pollutant for our study.
Monitoring air pollutants is crucial for identifying hotspots and shaping effective policies. The Central Pollution Control Board of India (CPCB) (
https://cpcb.nic.in/, accessed on 19 August 2025) has established Continuous Ambient Air Quality Monitoring Stations (CAAQMS) in various cities, including Delhi, to assess air quality and regulate pollution levels by WHO [
10] and NAAQS [
11] standards. However, air quality monitoring (AQM) stations only cover
of the country, according to CPCB’s rule of thumb [
12]. The limited deployment of CAAQM sites makes it difficult to collect air quality data. Additionally, each CAAQM station incurs significant installation and annual maintenance expenditures [
13]. In terms of measurements other than federal reference monitors, there are alternative methods as well. Techniques are using low-cost sensing for air quality [
14] and satellite-based sensing [
15,
16] along with physical [
17] and hybrid machine learning models [
18]. With machine learning model-based techniques, with data-driven results, novel data signals can yield better pattern recognition [
19] with better interpretability, but gaps remain.
In parallel, the ubiquitous use of online social media (OSM) like Twitter/X, Reddit, Sina Weibo has strengthened its role in participatory sensing. While OSM does not replace physical monitors, OSM functions as a proxy indicator of public perception. Studies comparing multiple social platforms (e.g., Twitter vs. Reddit) have shown that user responses can differ in timing, volume, and content, underscoring the potential value of cross-platform perspectives in sensing public sentiment [
20]. Recent cross-platform studies also show that user perceptions can vary depending on the medium. For example, Ref. [
21] compared public discourse on electric vehicles across Reddit and Twitter, highlighting differences in demographic representation and discussion focus. Urban populations increasingly use platforms like Twitter/X and Sina Weibo to express their opinions on environmental issues. Recent studies have shown that social media data can capture community views on pollutants (especially
), track spatiotemporal pollution dynamics using machine learning, and reveal how people respond to deteriorating air quality [
22,
23,
24,
25,
26,
27]. India’s growing user base on Twitter/X offers a unique opportunity to analyse perceptions and patterns.
In our prior work [
13,
28], we examined how Twitter/X data can identify pollution signals amid noise, including time drifts between public perception and ground truth, anomalous surges of discussion, and user-network influences such as retweets, followers, and favourites. Baseline
measurements were collected from the U.S. Embassy reference-grade monitor at RK Puram, New Delhi (
Figure 1). Building on prior findings [
13,
28], and after integrating this stream with the second dataset, the present study extends the analysis to characterize temporal and user-specific variation across the pre-COVID-19 and COVID-19 periods. The baseline
data were collected from the US Embassy’s Reference Grade Sensor located at RK Puram (
Figure 1). Building on prior findings [
13,
28], the current study extends the analysis to characterize the temporal and user-specific variations across pre-COVID-19 and COVID-19 periods.
During COVID-19, Delhi experienced up to a
reduction in particulate matter concentrations [
31], alongside shifts in online discussions that focused on “Government Initiatives,” “Pollution Control Behaviours,” and, to a lesser extent, “Awareness Campaigns” [
32]. Under normal conditions, seasonal changes in
are mirrored in Twitter/X discussions, with frequent use of terms such as “severe,” “breathe,” “choke,” and “worse” [
25,
33].
Based on this context, our research is driven by two questions:
- 1.
How do temporal and user-specific features of tweets (e.g., frequency, lags, and user characteristics) relate to concentrations?
- 2.
How do these factors contrast their behavioural patterns across seasonal variation and black swan events such as the COVID-19 pandemic?
The term “AirCalypse,” combining “Air” and “Apocalypse,” highlights the urgency of the air pollution crisis and continues our earlier series [
13,
28]. In this study, we empirically explore the time-synchronised association of
with Twitter/X metadata in New Delhi over 18 months (February 2019–September 2020), spanning pre-COVID-19 and COVID-19 timelines. Unlike prior work focused primarily on content, this paper emphasises metadata, tweet frequency, lags, recurrence, and user authenticity as features with potential utility for future data-driven air quality modelling.
2. Literature Survey
The current study examines the temporal relationship between Twitter-specific features, such as daily tweet frequency, tweet lags (1–5), user followers, verified, listed, and user favourites, and levels as part of outdoor air quality monitoring. Given the efficacy of online social media (OSM) in air quality monitoring, popular social media platforms such as Twitter/X and Sina-Weibo (China’s largest micro-blogging service) have been observed to serve as the primary data sources for obtaining reports, opinions, sentiments, and so on, disseminated by the community. In previous investigations, researchers used various mining techniques to interpret community perception traits into these OSM platforms while respecting outdoor air quality monitoring.
First, several works have focused on prediction and modelling approaches using meteorological and pollutant data. In [
34,
35], the authors exploit primary and secondary meteorological and pollutant data, suggesting a machine learning-based technique to increase air quality prediction. Similarly, Ref. [
36] applied a Bi-LSTM deep learning model to assess air quality changes before, during, and after COVID-19 lockdowns across multiple cities in Henan, China. Their findings highlight how restrictions reduced
,
,
, and
, illustrating both the predictive capacity of deep learning and the unique natural experiment created by the pandemic. These authors used various statistical analyses, machine learning, and deep learning methods to estimate the concentration of pollutants, focusing on pollutants. After identifying the most impacted pollutants using statistical methods, the machine learning and deep learning models were implemented, and it was discovered that deep learning models are more effective in forecasting.
Next, we showcase some studies that have examined Twitter/X as a tool for air quality monitoring. In [
37], the authors investigate how
air pollution hazards are defined and appraised in a networked public sphere, utilising Twitter data, government documents, and media stories. It blends Beck’s idea of risk society with digital media theories. The approach also emphasises a transnational but linguistically split public sphere and the influence of media. This article delves into some statistical analysis of the implications of research questions connected to the above features and maps their impacts. Whereas [
25] investigated the use of Twitter data for qualitative air pollution monitoring in Delhi between 2019 and 2020. Tweets were rated as poor, good, or neutral air quality using a machine learning model that included embedding and BiLSTM layers.
concentration values were determined by analysing tweets and official CAAQMS data. The approach demonstrated remarkable accuracy (80–
) under harsh air quality conditions. Its success depends on public awareness, Twitter engagement, and visible air quality improvements. The authors in [
26] investigate predicting urban air quality using Twitter data in cities without monitoring stations. A framework for gathering and geo-tagging relevant tweets is created, and transfer learning is utilised to apply ideas from monitored to unmonitored cities. Tests in UK and US cities reveal that Twitter-based estimations are accurate, although not as precise as spatial interpolation. However, combining the two systems enhances accuracy, particularly in remote towns. The study emphasises the utility of social media for air quality monitoring. Gradient tree boosting, a regression-based method, was applied here. The work in [
27] proposes a framework to model and analyse how air quality messages spread via Twitter/X. It investigates both the flow of messages and the content supplied by users. The method employs natural language processing (NLP) tools and deep learning classification algorithms to categorise tweets from scratch. It uses both quantitative and qualitative methodologies within an interdisciplinary framework. The methodology is demonstrated through a specific air quality use case. Finally, the work in [
33] analyses nearly two years of Twitter data (September 2015–May 2018) from Paris, London, and New Delhi to assess public responses to air quality issues. It was discovered that health concerns outweighed reactions to deteriorating air quality, particularly in New Delhi. The study discovers hashtags that best correlate with local pollution levels and demonstrates consistent public behaviour patterns across cities. Topic modelling identifies major themes such as health, policy, and event-specific pollution spikes. The study shows that Twitter can be useful for large-scale, real-time public opinion research on environmental health. Text classification has been carried out using machine learning methods.
Beyond the single-platform analyses, researchers have also examined cross-platform differences in perception. For example, Ref. [
20] looks into the 2019 Ridgecrest earthquake across Twitter and Reddit, showing how public response varies between platforms. Similarly, Ref. [
21] compared discussions on electric vehicles across Reddit and Twitter, finding distinct patterns of demographic representation and discourse focus. These studies highlight the criticality of understanding cross-platform variation when analysing public perceptions.
Besides Twitter, several studies have explored community perceptions posted on Sina-Weibo in the context of air quality monitoring. The authors in [
22] investigate how to track air quality trends and public perception. Researchers assessed 93 million posts using keyword filtering and topic models to discover pollution-related information. Message volumes were compared to official pollution data from 74 cities to evaluate reliability. A qualitative analysis of sample posts indicated frequent discussions of health issues and behavioural responses. The findings emphasise Sina Weibo’s potential as a valuable real-time environmental health monitoring source in China. Basic statistical tools such as Pearson correlation and qualitative data were used in this study. Similarly, Ref. [
23] uses geo-targeted Sina Weibo posts to track air quality trends in major Chinese cities. A social media analytics framework was created to investigate the relationship between Weibo postings and official Air Quality Index (AQI) data. Messages were divided into three categories: retweets, app-generated, and original individual posts. The original individual messages had the strongest association with AQI changes. The findings indicate that filtered social media data can track air quality changes over time. Gradient Tree Boosting (GTB) has been used to solve classification difficulties. In contrast, Ref. [
24] suggests that social media analysis can be a cost-effective alternative to traditional environmental monitoring in China. An Environmental Quality Index (EQI) was created to gauge public opinion about air, water, and food quality. Text data from Sina Weibo and Baidu Tieba (2015–2016) were examined using a support vector machine (SVM), obtaining
classification accuracy. The EQI scores were determined for 27 provinces. Results were consistent with official data, demonstrating the model’s viability and effectiveness.
Beyond outdoor air quality, researchers have also investigated indoor environments through social media. Ref. [
38] analysed indoor air quality using social media and NLP methods from the perception of United States-based occupants, highlighting the role of OSM in understanding indoor environmental health concerns. More broadly, OSM-based analysis has extended to environmental issues beyond air quality. Ref. [
39] conducted a sentiment and emotion analysis of environmental posts, which provides insights into how communities express concerns about ecological issues online. These studies show that OSM-based environmental monitoring is becoming more widespread indoors and outdoors.
Finally, beyond social media–driven studies, several investigations have specifically assessed the impact of COVID-19 lockdowns on urban air quality. In [
40], the authors examined the Madrid region, finding significant
and
reductions during mobility restrictions. Ref. [
41] analysed pollutant patterns in Lahore, Pakistan, reporting sharp declines during lockdown followed by post-lockdown surges, with strong correlations between PM and
. Similarly, Ref. [
42] studied Shanghai, observing reductions of 61% in
and 43% in
, underscoring the combined role of emission reductions and meteorological influences. Together, these studies reinforce that COVID-19 restrictions offered valuable insight into the anthropogenic drivers of urban air quality.
Novelty of Present Study: It is evident from past studies that the evolution of AQI prediction has already been initiated through several investigations. So far, the community has tried to analyse the significance of meteorological and seasonal factors over pollutants for predictive modelling using classical machine learning, deep learning, and attention models [
34,
35]. Besides, the exploitation of Sina-Weibo textual messages associated with air quality is also made to (a) differentiate social media data with AQI, (b) quantification of public perceptions to pollutions, (c) monitor the spatio-temporal dynamics of AQI through machine learning, and (d) analyse & classify community response towards air quality degradation [
22,
23,
24]. Furthermore, it has been perceived from studies [
20,
21,
25,
26,
27,
37] that the contextual or cross-platform analysis of community response tweets brings policy makers & researchers to map social perception and AQI in real-time. Such associations were captured through analysing (a) temporal correlation of tweets having trending hashtags, (b) content classification through machine learning, and (c) trending pollution intent topics through unsupervised models in a timeline. However, investigating the variations of platform-specific metadata & their derivations with transforming pollution levels remains unexplored. The significance of criticality over relevance in data stream volume, user handles, and other additional metadata on air quality should be explored. Moreover, the contrasts in behavioural patterns of such factors with shifting air quality at pre-COVID-19 and COVID-19 timelines (pre-COVID-19 20 March 2019 to 19 March 2020, COVID-19 20 March 2020 to 20 September 2020), and the rationale behind such patterns, should also be examined. In the current study, such an attempt has been made by exploring the temporal & user-defined properties of tweet objects in terms of daily magnitude, lags 1–5, user followers, user verified, user listed, and user favourite to analyse their impact on air quality (particularly on variations in
concentration). Later, the significance of features, i.e., intensity of community engagement, community intents, user authenticity, and tweet recurrence rate, derived from temporal & user-specific properties, is studied & analysed at changing
concentration at the mentioned pre-COVID-19 & COVID-19 timeline. Finally, the behavioural patterns of key features are evaluated on the pre-COVID-19 and COVID-19 timelines, highlighting their efficacy in detecting
concentration.
3. Material & Methods
Considering severe air pollution in Delhi, hashtags such as #airpollutiondelhi, #delhismog, #delhiairpollution, and #delhipollution gained significant traction on Twitter as air quality levels worsened dramatically. The increase in pollution adversely impacts residents of both urban and suburban regions, resulting in a substantial surge of tweets that rapidly turn these hashtags into trending topics. The initial phase of our analytical framework involved collecting tweets related to air pollution from Twitter. For this purpose, we used Twitter’s Streaming API with the Researcher Access API (
https://docs.tweepy.org/en/stable/api.html, accessed on 20 August 2025). It was continuously streamed through a local server running 24 × 7 from 20 March 2019 to 20 September 2020, retrieving more than
million tweets. Network and power interruptions resulted in snags in data collection, with some days of data missing within this period. Every tweet gathered via the API contains several essential attributes, including a unique 64-bit integer tweet-id, the creation timestamp (created_at), the user_id of the tweet author, tweet text, and many more. Given the scope of our research for monitoring Delhi’s air pollution, we filtered the data set to include only English-language tweets explicitly related to Delhi’s air pollution for reliable preprocessing and NLP tool support. We collected using the X/Twitter streaming API filtering features that allow exclusion based on language, location, and specific keywords. Additional filtering was performed using combinations of targeted hashtags such as #NewDelhiairpollution, #delhipollution, #delhismog, #delhichokes and #savedelhi. After applying such parameters, we analysed the dataset of 1.1 million tweets, focusing on tweet content and user profiles. Related to the filtration of tweet content while analysing user intents, we only considered the removal of undesired elements, i.e., stop words, hashtags, links, emojis, URLs, @, other exclamations, and non-ASCII characters, since they are not required in the intent analysis. We also collected air quality data from the US Embassy’s monitoring station in Delhi (
https://in.usembassy.gov/air-quality-data-information-4/ accessed on 20 August 2025). The US Embassy’s data provided detailed monitoring on principal pollutants, including
, with rigorous data validation recorded at 60-min intervals. These ground truth data were analysed alongside the tweets for the time frame from March 2019 to September 2020, enabling a comprehensive assessment of pollution patterns, particularly concerning
levels in Delhi. The details about the data collection process have been depicted through
Figure 2.
Feature Analysis in Pre-COVID-19 & COVID-19 Scenario
For analysing the impact of temporal & user-specific features on air quality in pre-COVID-19 & COVID-19 scenario, the tweets related to air pollution in Delhi are considered for analysis, which spanned around the timelines, i.e., 20 March 2019 to 19 March 2020 & 20 March 2020 to 20 September 2020 respectively. Here, the periodic distribution of Twitter-specific features, i.e., tweet frequency, tweet lags 1–5 and user-specific features, i.e., followers, verified, listed, and favourite counts, have been assessed. Tweet frequency is defined as the number of tweets that have been posted in an interval on a particular topic. It plays an important role in OSM as it impacts the topic visibility and the user engagement. Their impacts are measured in context to the raw concentration of at the timelines.
Temporal Features: There has been a lot of research carried out in the recent past, which shows the correlation between
concentration levels and social media posts (X/Twitter, and Sina-Weibo) related to pollution at different geographic granularities. For instance, the authors in [
33] have established significant associations for pollution-related posts of London, Delhi, Beijing, and many more. Such insights demonstrate the public concerns with the increasing rise of
concentration, which serves as a proxy for air quality monitoring. Besides, through [
13], it is evident that along with the inherent rise of social perception with the increase of
, there has been a time drift between social perception on Twitter and actual ground truth (raw concentration of
). The reason is that social perception takes longer to form compared to chemical sensors employed in sensors. Considering such nature, tweet frequency lags, i.e., lag 1–5, have been regarded as features based on a day basis to shift the delay in social perception with sensory ground truth data. The temporal lag features are generated from the aggregated daily tweet counts to explore the relationship between X/Twitter community perception on social media and measured air quality. The lag feature represents the shifted value of the original time series, such that the information from prior days is used to explain present-day variation. In this study, the lagged variables were created for one to five days preceding the current observation, i.e., Lag 1 corresponds to the number of tweets posted one day prior, Lag 2 corresponds to two days prior, and so on up to Lag 5. The rationale for including lagged features is twofold. First, the human behavioural responses to changes in air quality are not instantaneous. For example, exposure to elevated
levels may increase online discourse only after symptoms are felt or after media coverage disseminates the event. Second, from a modelling perspective, lagged features minimise the risk of temporal leakage by ensuring that only past user activity is used to interpret or forecast present air quality levels. Prior studies on temporal dynamics of social media have also indicated that event-related discussions often peak with a delay due to the diffusion of information across online networks. By incorporating daily tweet lags, we aim to capture these delayed behavioural patterns and assess their association with ground-truth pollution measurements at the RK Puram station in Delhi.
User Features: Recently, several studies examined how Twitter user profile variables, including follower count, verification status, listing count, and favourite count, can predict user influence in debates about air pollution and
monitoring. The authors in [
28] investigated user-specific attributes to identify important users and forecast retweets, implying that these measures are strong markers of user influence and engagement. Such influence and community engagement in pollution monitoring have been analysed daily with sensory ground truth, i.e.,
concentration.
6. Conclusions & Future Research
This study empirically demonstrates that predictive modelling for air quality monitoring can be significantly enhanced by using methods beyond the conventional seasonal, meteorological, and content-based features. While prior research has significantly used community sentiments, trending hashtags, and qualitative intent analysis from social media data, our findings highlight the importance of systematically incorporating platform-specific metadata like temporal lags, user-authenticity, engagement patterns, and recurrence rates into the modelling frameworks.
The results show a clear empirical relationship between
concentrations and community perceptions when assessed through temporal and user-specific attributes. Moreover, the current analysis shows that the features exhibit strong dependencies when observed with air quality fluctuations across both seasonal and COVID-19 transitions. This underscores their utility as complementary signals for ground-truth measurements. In this study our analysis is limited to New Delhi, but there are similar studies in other contexts (e.g., Paris and London [
33], multiple Chinese cities [
22,
23,
42], and South Asian urban centers such as Lahore [
41]) which demonstrate that linking social media signals with air quality monitoring has broader relevance. Referencing these findings strengthens the generalizability of our framework beyond Delhi. In the next step, we should assess this work across multiple cities using additional monitoring stations and multilingual analyses, enabling broader insights into public perception and region-specific participatory monitoring strategies.
Thus, the results from this research can be expanded in the following ways:
Integration of multimodal data sources: We plan to combine social media metadata with content-based features, low-cost sensor data, and satellite observations to construct more robust forecasting models.
Model explainability: When we can quantify the relative contribution of each feature (temporal, user-specific, content-based, and multimodal), we can improve transparency, interpretability, and policy relevance. That is another avenue of work expansion.
Regional Variability: Expand the study in a multi-city context by incorporating data from additional monitoring stations and expanding to multilingual analyses.
Sustainability-driven Policy Making: When we cannot measure, we cannot make decisions that can help in sustainable development. Thus, deployment of models that are enriched with explainable features can guide urban planning, adaptive pollution control, and participatory monitoring in contexts where traditional infrastructure is limited. Hence, expansion of this avenue needs further attention.
Thus, through these research directions, we propose to advance predictive accuracy while ensuring that air quality monitoring frameworks remain robust, transparent, equitable, and actionable even when the majority of the globe does not have continuous monitoring federal reference grade monitors. Without proper monitoring of the air quality, there will be a dearth of informed policy making. Hence, a multimodal data based air quality models can provide a way for data-driven decision making and in attaining the Sustainable Development Goals.