Next Article in Journal
A DMAIC Framework to Improve Quality and Sustainability in Additive Manufacturing—A Case Study
Previous Article in Journal
Does Postural Feedback Reduce Musculoskeletal Risk?: A Randomized Controlled Trial
 
 
Article
Peer-Review Record

Key Concerns and Drivers of Low-Cost Air Quality Sensor Use

Sustainability 2022, 14(1), 584; https://doi.org/10.3390/su14010584
by Priyanka Nadia deSouza
Reviewer 1:
Reviewer 2: Anonymous
Sustainability 2022, 14(1), 584; https://doi.org/10.3390/su14010584
Submission received: 17 November 2021 / Revised: 3 January 2022 / Accepted: 3 January 2022 / Published: 5 January 2022
(This article belongs to the Topic Climate Change and Environmental Sustainability)

Round 1

Reviewer 1 Report

Key concerns and drivers of low-cost air quality sensor use

The author analyzed the Amazon user reviews (~1900) of low-cost air quality sensors (LCS) with an aim of aiding the development of ‘fitness of purpose’ appraisals for end-use applications. Freely-available consumer-provided reviews on LCS were accessed to analyze qualitative data using unsupervised machine learning approaches.

The author has provided an excellent description of data access, pre-processing and approaches to qualitative data analysis. The 9 topic model is well justified and covers the most important attributes to assess ‘fitness of purpose’. The word cloud visualizations provide an excellent summary of the most common themes discussed by the consumers and is very relevant to the objective of the paper.

Education and curiosity to learn more about air quality was found to be the most prevalent theme discussed by the reviewers. The author identified relevant topics commonly discussed by the customers that could help policy-makers develop interventions to best build and grow communities of sensors. Given the consumers growing interested and understanding of LCS, it would be useful to the readers if the author can discuss the importance or the need of including ‘performance evaluation certificates and recommendations’ by the sensor manufacturers. This ties up nicely with the highly prevalent theme identified - education and curiosity.

Limitations of the study are clearly identified and well stated.

Author Response

Thank you for these comments. In accordance with the reviews I have further justified the number of topics chosen using semantic coherence and cross-validation plots. In order to be consistent with these plots, I have rerun my model and chosen the number of topics to be 7 instead of 9. This does not change my overall results. 

 

I have tried to dig deeper into the discussion on the performance evaluation of low-cost sensor evaluations by adding the following text:

“The identification of these uses can spur new ‘fitness of purpose’ tests of low-cost sensors for these purposes. For example, it is becoming increasingly evident that it is important to evaluate the performance of such devices to detect wildfire smoke, as many users are making important decisions to protect their health, such as deciding when to open their windows, during such events. Existing protocols, such as those used by the South Coast Air Quality Management District (http://www.aqmd.gov/aq-spec, Last accessed December 23, 2021)  to evaluate sensor performance typically do not include testing these devices during smoky conditions. However, important research is underway to evaluate different devices during wildfires, and to develop adjustment factors for measurements from different low-cost sensors to improve their accuracy in capturing wildfire smoke [22,23].

Author Response File: Author Response.docx

Reviewer 2 Report

My general comment is the manuscript is quite interesting where the author discusses user reviews regarding low-cost sensors which sell on Amazon. However, many aspects should be more clear and the grammatical aspect is required to be improved. In addition, please check again the format of the journal especially the indent in paragraphs. For my specific comment as follow:

  1. In section 2 about the method, the author mentions unsupervised learning. but in the introduction, the author does not mention anything about it. the author should mention it so that there is a connection between the explanation in the introduction and the method. 
  2. I think the author should be put the summary data used in this manuscript using the Table. So the reader will be easier to understand the data.
  3. For the pre-processing, It difficult to understand and why we are required to remove some aspects in the text. 
  4.   Regarding the pre-processing method, I suggest the author can use a block diagram to make it easy for the reader to understand.
  5. Similar to point 4, please provide the block diagram to summary the step in the Unsupervised learning process. it will make  the reader easy to understand what the author does using the unsupervised learning
  6. Please more explain the algorithm that is used by the author in unsupervised learning.
  7. Regarding the topic, how the authors can determine the 9 topics is better compared to 5, 6, 8, 9, 10, 15, and 20. Any statistical proof for this part.
  8.  And how the author divided the dataset for the training process.
  9.  what does the value mean in each Topic? (Table 1) 
  10. What are the means of a word cloud? is the refer to result of the model?
  11. The figure in Table I is not clear and very small. Please improve the resolution of the figure
  12. Please don't write the pronoun "I" in the paragraph
  13. Any explanation about the Kruskal-Wall test in this manuscript?
  14. where is the conclusion?
  15. Please improve the quality of the Figure and make it clearly 

Author Response

Reviewer 2

 

My general comment is the manuscript is quite interesting where the author discusses user reviews regarding low-cost sensors which sell on Amazon. However, many aspects should be more clear and the grammatical aspect is required to be improved. In addition, please check again the format of the journal especially the indent in paragraphs. For my specific comment as follow:

Thank you for this comment. I have improved the grammar of my article.

  1. In section 2 about the method, the author mentions unsupervised learning. but in the introduction, the author does not mention anything about it. the author should mention it so that there is a connection between the explanation in the introduction and the method. 

Thanks for this important comment. I have included the following text in the Introduction:

“Infodemiology, the practice of analyzing consumer-sourced qualitative data, such as Amazon reviews, is an emerging area of research that uses freely-available consumer-generated data to provide insights that can be used for making public health-related decisions [12]. As little data currently exists on the usage of low-cost sensors by the public, Amazon text reviews and ratings (1-5 stars) of products can provide valuable insights to policy-makers as a key first step. This paper uses topic modeling, an unsupervised machine learning technique, to extract key topics from the 1000’s of reviews downloaded. Essentially, topic models pick up co-occurrence signals between different words in the collection of text. The underlying assumption is that words that occur often in the same sentence are likely to belong to the same latent topic.”

 

  1. I think the author should be put the summary data used in this manuscript using the Table. So the reader will be easier to understand the data.

Thank you. I have made this change.

  1. For the pre-processing, It difficult to understand and why we are required to remove some aspects in the text. 

Thank you. I have added the following explanation for each pre-processing step for clarity:

“The corpus was pre-processed in the following manner:

 

(i) All reviews where the word count was less than 25 words were removed. This is because words with less than 25 words contained little information that would be relevant to a topic model. Three reviews that were not in English were removed. 1,888 reviews from 94 different monitors remained

(ii) Special characters (non-ASCII characters), punctuation and numbers were also removed from the review text

(iii) Stop words such as ‘the’, ‘of’ from the SMART stopwords list [13] which are built into the tidytext package in R were removed [14]. 

(v) After this step additional stop words from a custom list were removed: { "air", "read", "devic", "unit", "meter", "qualiti", "product", "quality", "device",  "monitor", "amazon", "measure", "hcho", "pm", "pm25", "reading", "bought", "readings","values","sensor", "data", "change", "compare", "compared", "found", "level", "measure", "results", "levels", "pollution", "check", "highly", "found", "house", "home", "googl", "browser", "html5", "video",  "formaldehyd", "formaldehyde", "ppm", "ghz", "pro", "tvoc", "voc", "temperatur", "co2", "hour", "set", "minut", "time","la", "de", "lot", "el", "la", "particle", "particulate" }

Stop words tend to be high-probability terms that can skew the word type probability distribution and slow inference from topic models.

(iv) Stemming was used to lemmatize words and their derivatives (eg. determine, determined, determining), thus rendering all derivative forms of a single word in an unambiguous non-inflected state. As language exhibits a rich inflectional morphology, if derivatives of the same word are treated as unique tokens, the co-occurrence signal between different words in the corpus under consideration will become weaker. We thus apply stemming to improve the quality of the topic model.”

 

  1.   Regarding the pre-processing method, I suggest the author can use a block diagram to make it easy for the reader to understand.
  2. Similar to point 4, please provide the block diagram to summary the step in the Unsupervised learning process. it will make  the reader easy to understand what the author does using the unsupervised learning

Thank you for comments 4) and 5). After preprocessing (removing stop words, stemming words), running the topic model using the stm package was a single step. Given the relative simplicity of this process I have opted to use bullets to explain this process. This allows me to explain why each step was conducted. Such an approach has been used in other papers referred to in the text such as: Hsu A, Rauber R. Diverse climate actors show limited coordination in a large-scale text analysis of strategy documents. Commun Earth Environ. 2021 Feb 9;2(1):1–12.

  1. Please more explain the algorithm that is used by the author in unsupervised learning.

Thank you. I have attempted to add more of an explanation:

“Unsupervised machine learning has been commonly used in studies that do not have any predetermined framework to analyze unstructured text data [13–15].  Such topic modeling enables researchers to discover topics from the data. This process may avoid biases generated through non-automated coding that rely on subjective interpretations of the text.

 

Essentially, a topic model represents the overall themes or topics in a corpus as probability distributions over words in a vocabulary; so while the probability of the word: “smoke” is high in a topic relating to wildfires, it will likely be relatively low in a topic pertaining to ease of use of a low-cost sensor. Documents (which in this case are reviews) consisting of combinations of words were modeled using a generative process where first a topic is selected according to some probability distribution specific to a given document, and then a word is selected from that topic in accordance with the topic’s distribution over vocabulary words. Using the documents in this corpus (which are the output of such a generative process) we can infer the likelihood of each topic given a document, and each word given a topic through a training process. 

 

Latent Dirichlet allocation (LDA) is the most commonly used topic model. However, LDA has several limitations; one of the most important being that it assumes that topics are independent of each other. The recently introduced structural topic modeling (STM) algorithm builds on the LDA model, and allows topics to be correlated [16]. For a more detailed discussion of STM models refer to [17]. In this paper, topic modeling was implemented in R with the stm package [18] using the spectral algorithm without the inclusion of covariates [19]. When the number of documents is large, as is in this case, the spectral algorithm has been shown to perform well and provide more stable and consistent results than the LDA model [18].

 

  1. Regarding the topic, how the authors can determine the number of topics Any statistical proof for this part.

Thank you. I have added the following text and figure. I have changed the number of topics from 9 to 7 to be consistent with this step. The overall results do not change.

To determine the number of topics in the text this paper used the metrics of held-out likelihood and semantic coherence provided by the STM package. Held-out likelihood (an indication of cross-validation), is calculated by holding out 10% of words in the corpus, training the model and using document-level latent variables to evaluate the probability of the held-out words. Semantic coherence (an indication of higher topic interpretability), measures the frequency of the co-occurrence of top words of a topic [20]. This paper compared the performance of 5, 6,7, 8, 9, 10, 15, and 20 topic models using these metrics. 

 

It can be observed that seven topics appear to provide the best trade-off between the greatest coherence and held-out likelihood (Figure 1). A model with seven topics also withstood the author’s subjective evaluation of the themes produced.

Figure 1: Diagnostics to determine the number of topics

 

  1.  And how the author divided the dataset for the training process.

The package holds out 10% of the words for testing. This is described in the cross-validation process described in the response to comment 7).

  1.  what does the value mean in each Topic? (Table 1) 

I have updated Table 1 to make it more clear

  1. What are the means of a word cloud? is the refer to result of the model?

I have updated Table 1 and added a description. The word clouds represent high probability words associated with each topic. Specifically, Figure 2 which now has the word clouds is described in the text in the following manner:

“The word clouds for each topic are created and presented in Figure 2, which show the words with the highest probability of occurrence in a topic.  The larger the font, the higher the probability of the occurrence of the word.“

  1. The figure in Table I is not clear and very small. Please improve the resolution of the figure

Thank you. I have recreated high res figures. I have additionally uploaded high res graphics.

  1. Please don't write the pronoun "I" in the paragraph

Thank you. I have rewritten the paper to remove the ‘I’’s

  1. Any explanation about the Kruskal-Wall test in this manuscript?

I have added this additional explanation:

“Non-parametric Kruskal-Wallis tests, also termed as one-way ANOVA on ranks, (Shapiro tests revealed that the assumption of normality of the distribution of prevalence of each topic did not hold) were run to evaluate if the prevalence of each topic varied significantly by the customer’s rating, the device being reviewed, and the pollutants reported. Statistical significance in this study is assessed using a p < 0.05 threshold.”

  1. where is the conclusion?

The conclusions section for this journal are optional and only required when the discussion is complex. The discussion section of this article contains the key insights derived from the text analysis: namely 1) that new protocols are needed to evaluate the performance of low-cost sensors under wildfire conditions, 2) users evaluate their devices in different ways. There is a great need for the  performance evaluation of low-cost sensors by regulators to be more widely disseminated to the public

  1. Please improve the quality of the Figure and make it clearly

Thank you. I have recreated high res figures. I have additionally uploaded high res graphics. 

 

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

Thanks to the authors for improving the quality and clarity of this manuscript. However, Please keep adding the block diagrams that reviewers have asked in points 4 and 5. 

Author Response

Thank you. I have added a flow chart summarizing the preprocessing steps as described

Back to TopTop