Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Improving Time Series Data Quality: Identifying Outliers and Handling Missing Values in a Multilocation Gas and Weather Dataset

Smart Cities 2025, 8(3), 82; https://doi.org/10.3390/smartcities8030082

by Ali Suliman AlSalehy^1,2

and Mike Bailey^1,*

Reviewer 1:

Gustavo Ramirez-Espinosa

Reviewer 2: Anonymous

Smart Cities 2025, 8(3), 82; https://doi.org/10.3390/smartcities8030082

Submission received: 27 March 2025 / Revised: 24 April 2025 / Accepted: 30 April 2025 / Published: 7 May 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript presents a framework for enhancing extensive databases through the handling of missing data and the detection of anomalous (outlier) values. The authors provide a detailed explanation of their methodology and justify their decisions by referencing prior work. Two research questions are clearly defined, from which the main contributions are derived.

General Evaluation:

The topic addressed is relevant and of interest to the community. The paper demonstrates solid theoretical grounding, and the organization is generally clear. The English language used is of good quality, with only minor errors. However, several issues need to be addressed before the manuscript can be considered for publication.

Major Comments

Highlighting Contributions: While the article articulates its contributions, these are not clearly emphasized in the early sections. I strongly recommend that the authors explicitly state the contributions in the introduction or early in the methodology section, rather than leaving them to be inferred from the concluding sections.
Use of Preprints as References: The manuscript relies on some preprint references, some of which are more than two years old. While preprints can be valuable for reflecting the latest developments, those that remain unpublished after such a period may not have undergone peer-review or may have been rejected. This raises concerns regarding the reliability of the claims derived from them. I recommend replacing or supplementing these older preprints with peer-reviewed sources wherever possible to strengthen the scientific grounding of the work.
Repetition and Redundancy: The manuscript is highly repetitive, with similar ideas being restated multiple times throughout the text. This affects both the readability and the overall length of the paper, which currently appears excessive. A thorough revision aimed at reducing redundancy and increasing conciseness is strongly recommended.

Specific Comments on Presentation and Figures

Figures 4 and 5: These figures appear to convey almost identical information. A table could more effectively summarize the interpolated values, reducing space and improving clarity.
Figure 12: This figure does not provide clear or actionable information, as it relies solely on visual inspection, which can be subjective and imprecise. Consider replacing or supplementing this with a statistical comparison that quantitatively assesses similarity.
Table 3: Table 3 effectively summarizes algorithm performance and could serve as a foundation to streamline the accompanying narrative, which currently duplicates much of the same information.

Technical and Methodological Concerns

Inconsistent Description of Missing Data Handling: In the framework description, the authors mention replacing unknown missing values with the mean of two known values. However, other techniques are later introduced without clarification about this, leading to some confusion. Please clarify about the use of the mean during the first steps of the framework.
Masking Algorithm Limitations: The masking-based algorithm for detecting missing values may fail in scenarios such as [10, NaN, 5, NaN, 5], particularly when using shift operations and AND masks. This could result in incorrect detections. Please clarify whether this case was considered and how such scenarios are handled.
Parameter Specification for Reproducibility: Although various algorithms and windowing techniques for seasonality analysis are discussed, the manuscript lacks detailed parameter settings such as window sizes and tuning values. These are essential for reproducibility and should be explicitly stated. Authors could see this reference as example: G. Ramirez-Espinosa, P. Chiavassa, E. Giusto, S. Quer, B. Montrucchio and M. Rebaudengo, "Improving Data Quality of Low-Cost Light-Scattering PM Sensors: Toward Automatic Air Quality Monitoring in Urban Environments," in IEEE Internet of Things Journal, vol. 11, no. 17, pp. 28409-28420, 1 Sept.1, 2024, doi: 10.1109/JIOT.2024.3405623.
Ambiguity in the Flow Diagram: In the flow diagram presented in the manuscript, the process of restoring outlier values back to their original state is mentioned, but the mechanism or criteria used to do so is not clearly specified. This step is critical for understanding the overall framework and should be expresed in the flow diagram.

Minor Issues

A few notational errors were found, particularly regarding the sign in soil temperature measurements.
Some acronyms are introduced without being defined at first mention; please ensure all acronyms are expanded on first use.

Recommendation

The manuscript addresses a relevant and timely topic with a well-structured framework. However, the issues related to clarity, repetition, methodological transparency, and visual representation need to be addressed in a major revision. I encourage the authors to revise the manuscript accordingly, as it has the potential to make a meaningful contribution once improved.

Author Response

Respond to the Reviewer

We sincerely thank Reviewer for their thorough review and valuable comments, which have helped us improve the manuscript significantly.

Comment 1: Highlighting Contributions: While the article articulates its contributions, these are not clearly emphasized in the early sections. I strongly recommend that the authors explicitly state the contributions in the introduction or early in the methodology section, rather than leaving them to be inferred from the concluding sections. Response 1: In response, we have revised the Introduction to clearly and explicitly state the core contributions of the study in a dedicated paragraph, placed before the “Research Question” on page 3, lines 85–94. This addition highlights the methodological innovations and practical value of our work up front. Additionally, we have included a short transition paragraph at the beginning of the Methodology section on page 8, lines 318–321 that reinforces the contribution and links it directly to the proposed implementation. These changes ensure that readers can easily identify the novelty and significance of our approach from the outset.
We have added the following to the Abstract:
The main contributions of this study are as follows:
(1) We develop a dual-phase data quality pipeline for environmental time series, combining statistical and machine learning techniques for outlier detection and imputation.
(2) We propose a sequential strategy for handling zero values, isolated gaps, and prolonged missing sequences that preserves temporal integrity.
(3) We apply curvature-aware interpolation methods, specifically PCHIP and Akima. These methods preserve the natural shape of time series data during imputation. They significantly reduce error, with MSE between 0.002 and 0.004, and R² values between 0.95 and 0.97, on a 14-million-record dataset from the Royal Commission.
(4) We demonstrate that the proposed workflow is adaptable to other smart city and IoT datasets with minimal adjustments.
We have added the following to the Methodology:
To operationalize the proposed framework, we designed a structured data quality pipeline tailored for environmental time series. The approach combines domain-aware statistical analysis with machine learning techniques to detect and correct anomalies, preserving temporal continuity and minimizing imputation error. Comment 2: Use of Preprints as References: The manuscript relies on some preprint references, some of which are more than two years old. While preprints can be valuable for reflecting the latest developments, those that remain unpublished after such a period may not have undergone peer-review or may have been rejected. This raises concerns regarding the reliability of the claims derived from them. I recommend replacing or supplementing these older preprints with peer-reviewed sources wherever possible to strengthen the scientific grounding of the work. Response 2: We agree that citing peer-reviewed sources is essential for ensuring the reliability and scientific credibility of the manuscript.
In response, we conducted a thorough review of all previously cited preprint references. Our actions for each are as follows:

Zhou et al. (Weather2K): Now published in the Proceedings of AISTATS 2023. The reference has been updated.
Fortin & Liang (2021): Remains a preprint; replaced with Bansal et al. (2021), DeepMVI, published in PVLDB.
BayOTIDE (Lee et al., 2023): Now published as Fang et al. (2024) in ICML 2024. Updated accordingly.
Ishaq et al. (2023): Remains a preprint and has been removed as it was not critical.
Dias et al. (2023): Now published in LNNS (DCAI 2023). The reference has been updated.

As a result, the manuscript now cites only peer-reviewed literature, improving scientific rigor and citation integrity. Comment 3: Repetition and Redundancy: The manuscript is highly repetitive, with similar ideas being restated multiple times throughout the text. This affects both the readability and the overall length of the paper, which currently appears excessive. A thorough revision aimed at reducing redundancy and increasing conciseness is strongly recommended. Response 3: We appreciate your observation and understand that excessive restatement can affect readability and length.
We structured sections to be self-contained, but agree that Sections 6.3 and 6.4 repeated content from Section 4. We have revised those sections (page 30, lines 1039–1049 and 1051–1064) to focus on outcomes, trade-offs, and future directions, removing redundant method descriptions.
Revised Section 6.3:
To preserve logical consistency, we prioritized early handling of single missing values via linear interpolation before outlier detection (Section 4.3). This ensures localized gaps do not interfere with anomaly detection or global imputation. While this may introduce bias if neighbors are outliers, we mitigate it with a full outlier filtering phase. Future work could explore spline or polynomial interpolation for non-linear gaps.
Revised Section 6.4:
Outlier detection combined statistical and ML methods (IQR, Z-score, LOF, Isolation Forest) to capture both global and nuanced anomalies (Section 4.4). Treating outliers as missing before imputation preserved timestamps and alignment. No single method sufficed: statistical filters caught obvious errors, ML methods found subtle patterns. Future refinements could include dynamic thresholds adjusted for seasonality or sensor drift. Comment 4: Figures 4 and 5: These figures appear to convey almost identical information. A table could more effectively summarize the interpolated values, reducing space and improving clarity. Response 4: The original intention of Figures 4 and 5 was to visually compare the dataset before and after interpolation. Figure 4 highlighted the locations of missing values between known data points, while Figure 5 showed the same dataset after those values were filled using linear interpolation.
We agree that having these figures placed apart may have given the impression of redundancy. In response, we removed Figure 5 and repositioned Figure 4 next to Figure 3 on page 12 to create a clearer side-by-side comparison. We also revised the captions to explain that the missing values shown in Figure 3, including erroneous zeros, were successfully imputed, as shown in Figure 4. While a table could summarize the number of interpolated values, it would not capture the spatial and temporal distribution of missing data. Visual comparison is especially important
in time-series analysis, where patterns and continuity are essential. Placing the figures side by side allows the reader to verify that all gaps were properly filled and confirms the effectiveness of the interpolation process We have added the following Figure 1 to page 12 Comment 5: Figure 12: This figure does not provide clear or actionable information, as it relies solely on visual inspection, which can be subjective and imprecise. Consider replacing or supplementing this with a statistical comparison that quantitatively assesses similarity. Response 5: We supplemented the visual comparison with a quantitative Mean Squared Error (MSE) analysis. The MSE changed by only 0.16% after adding noise (page 24, lines 816–823), confirming robustness. We updated the caption for Figure 11 (renumbered) to mention this result:
Revised Caption:
Figure 11. Visual comparison of original and perturbed CO values for sensor S1 over time, as part of the sensitivity analysis. While the curves appear nearly identical, this is supported by a quantitative comparison showing only a 0.16% change in Mean Squared Error (MSE), confirming the stability and robustness of the imputation method. Comment 6: Table 3: Table 3 effectively summarizes algorithm performance and could serve as a foundation to streamline the accompanying narrative, which currently duplicates much of the same information. Response 6: Table 3 is a central element of the Results section, and our intent was to present the comparative performance of the imputation methods in a clear, at-a-glance format. To support the table, we included a brief summary in the surrounding text to help orient the reader and highlight key takeaways without duplicating the detailed values already shown. We agree that deeper interpretation belongs in the Discussion section, and have structured the manuscript accordingly. In response to your comment, we reviewed the Results text to ensure it remains concise and does not repeat information from the table. The current version keeps the focus on presenting the findings in Table 3, while deferring analysis and interpretation to Section 6. We believe this structure improves clarity and avoids redundancy while helping readers follow the progression from results to conclusions. Comment 7: Inconsistent Description of Missing Data Handling: In the framework description, the authors mention replacing unknown missing values with the mean of two known values. However, other techniques are later introduced without clarification about this, leading to some confusion. Please clarify about the use of the mean during the first steps of the framework. Response 7: While an earlier draft referred to mean imputation for isolated missing values, our final methodology consistently applies linear interpolation for these cases. Because the time intervals between data points are fixed (one hour), calculating the mean of the two neighboring values is mathematically equivalent to linear interpolation. However, we agree that consistency in terminology is important to avoid confusion. To address this, we revised Sections 4 on page 9, lines (337 and 353) to clearly state that linear interpolation is used for single missing values and removed any outdated references to mean imputation. These clarifications do not affect any results, as the correct method was applied consistently throughout the implementation, evaluation, and validation. Comment 8: Masking Algorithm Limitations: The masking-based algorithm for detecting missing values may fail in scenarios such as [10, NaN, 5, NaN, 5], particularly when using shift operations and AND masks. This could result in incorrect detections. Please clarify whether this case was considered and how such scenarios are handled. Response 8: explicitly detects single missing values by identifying positions where a NaN is flanked by two non-missing values. This is implemented using the following logic: single_gap = (s.isna() & s.shift(1).notna() & s.shift(-1).notna()) Indices where single_gap is true are marked as single gaps; others NaN values are marked as -1, zeros as 0, and valid values as 1.

Example:

Input: [10, NaN, 5, NaN, 5]
At index 1: s.shift(1)=10, s.shift(-1)=5 → single_gap[1]=True
At index 3: s.shift(1)=5, s.shift(-1)=5 → single_gap[3]=True

Both NaN values are correctly identified as isolated single missing values. By design, any consecutive NaNs or boundary NaNs will fail the neighbor checks and fall outside this condition, they are handled separately.
We have added this explanation and the illustrative example to Section 4.3.1 on page 12, lines 433–455 of the manuscript to clarify this behavior for the reader

Enhanced explanation (Section 4.3.1):
The first step involves identifying single missing values that occur between two known values in the same sensor time series. To ensure chronological consistency, the dataset is first sorted by time index. We then create a boolean mask using pandas.isnull() to flag all missing values. This mask is a sequence of True (for missing) and False (for present) values (see Figure 3).

mask_na = s.isnull() (missing)
prev_ok = s.shift(1).notna() (previous present)
next_ok = s.shift(-1).notna() (next present)

And compute:
single_gap = mask_na & prev_ok & next_ok

which flags exactly those NaN positions flanked by valid entries on both sides. For example, in the sequence [10, NaN, 30], the missing value at index 1 is correctly detected. More importantly, the algorithm also handles patterns like [10, NaN, 5, NaN, 5], where multiple isolated NaNs appear. Each NaN is flanked by valid values and is therefore identified as a single missing value.
This method explicitly excludes consecutive or boundary NaNs, which fail at least one neighbor check. Those cases are handled separately using forward‑fill and backward‑fill to maintain continuity when interpolation is not feasible.
The entire process is implemented in Python using Pandas and NumPy, and results are visually verified by overlaying the detection mask on the raw time series.

Comment 9: Parameter Specification for Reproducibility: Although various algorithms and windowing techniques for seasonality analysis are discussed, the manuscript lacks detailed parameter settings such as window sizes and tuning values. These are essential for reproducibility and should be explicitly stated. Authors could see this reference as example: G. Ramirez-Espinosa, P. Chiavassa, E. Giusto, S. Quer, B. Montrucchio and M. Rebaudengo, "Improving Data Quality of Low-Cost Light-Scattering PM Sensors: Toward Automatic Air Quality Monitoring in Urban Environments," in IEEE Internet of Things Journal, vol. 11, no. 17, pp. 28409-28420, 1 Sept.1, 2024, doi: 10.1109/JIOT.2024.3405623. Response 9: We agree that reproducibility is critical, and we appreciate the reference to Ramirez-Espinosa et al. (2024) as a useful example. In response, we have revised Section 4.4 on page 14, lines 530-532 and 539-544 and Section 5.4 page 23, lines 798-802 to include the specific parameter values used in our experiments. We have added the following to the Section 4.4: For Local Outlier Factor (LOF), we used n_neighbors = 20 to ensure sensitivity to local fluctuations while minimizing false positives in dense regions. Isolation Forest was configured with nestimators = 1000 and contamination= 0.02, reflecting the low expected frequency of anomalies. These settings were chosen based on empirical tuning and alignment with prior studies in environmental monitoring. We have added the following to the Section 5.4: For STL decomposition, we used a seasonal window size of 13 and a trend window of 15, and enabled robust fitting to reduce sensitivity to outliers. These parameters were selected based on typical weekly and monthly seasonal cycles in the data and provided the best reconstruction fidelity in our tests.

These additions strengthen the transparency and reproducibility of our methodology and ensure that all algorithms and their settings are now clearly documented.

Comment 10: Ambiguity in the Flow Diagram: In the flow diagram presented in the manuscript, the process of restoring outlier values back to their original state is mentioned, but the mechanism or criteria used to do so is not clearly specified. This step is critical for understanding the overall framework and should be expresed in the flow diagram. Response 10: We agree that the mechanism for restoring outlier values needed clearer explanation, and that the flow diagram should reflect this step explicitly. In response, we updated the flowchart (Figure 2) on page 10 to explicitly include the step labeled "Return Outliers", which occurs after imputation. This change improves transparency in the pipeline and helps readers understand how the final dataset preserves valid outliers while maintaining data integrity.

Comment 11: A few notational errors were found, particularly regarding the sign in soil temperature measurements. Response 11: We appreciate the reviewer for catching this typographical mistake "Nice Catch!". The missing negative sign before "1.0 m" on page 6, line 276 has been added in the revised manuscript. Comment 12: Some acronyms are introduced without being defined at first mention; please ensure all acronyms are expanded on first use. Response 12: We reviewed the entire manuscript and defined all acronyms at first occurrence, including in the abstract, tables, and figure captions.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors are requested to provide the innovation of this paper in the manuscripts.

In section 3, Data Description, the authors are requested to use tables to present the details of the dataset, such as the variables involved, the corresponding time period, etc.

Why are there no sensors 5 and 7 in Figure 1? What is the difference between these sensors? What are they detecting respectively?

In line 504, what does IQR in the formula stand for? Can authors describe the calculation method in some words, or give the specific values in a table?

In lines 756-779, for sequence missing values, how are the models set up? For example, what are the variables involved in multiple linear regression? Which variables does KNN select as features?

For sequence missing values, how does the length of the missing series affect the performance of missing value estimation? How long is the length of missing sequence data allowed by the sequence missing estimation method mentioned in the article?

What are the advantages of the method mentioned in the article compared to other deep learning methods based on CNN, LSTM, etc.? Such as,

[1] Khan, M. A. (2024). A Comparative Study on Imputation Techniques: Introducing a Transformer Model for Robust and Efficient Handling of Missing EEG Amplitude Data. Bioengineering, 11(8), 740.

[2] Yu, D., Kong, H., Leung, J. C. H., Chan, P. W., Fong, C., Wang, Y., & Zhang, B. (2024). A 1D Convolutional Neural Network (1D-CNN) Temporal Filter for Atmospheric Variability: Reducing the Sensitivity of Filtering Accuracy to Missing Data Points. Applied Sciences, 14(14), 6289.

[3] Park, J., Müller, J., Arora, B., Faybishenko, B., Pastorello, G., Varadharajan, C., ... & Agarwal, D. (2023). Long-term missing value imputation for time series data using deep neural networks. Neural Computing and Applications, 35(12), 9071-9091.

[4] Ma, J., Cheng, J. C., Jiang, F., Chen, W., Wang, M., & Zhai, C. (2020). A bi-directional missing data imputation scheme based on LSTM and transfer learning for building energy data. Energy and Buildings, 216, 109941.

In line 258, a - is missing before 1.0m.

Author Response

Response to Reviewers

We sincerely thank the Reviewer for their constructive feedback, which has helped us improve the manuscript. Below are our point-by-point responses to each comment, along with the corresponding revisions.

Comment 1: The authors are requested to provide the innovation of this paper in the manuscripts. Response 1:

In response, we have clarified the innovation of our work by revising the Introduction to clearly and explicitly state the core contributions of the study in a dedicated paragraph, placed before the "Research Question" on page 3, lines 85–94. This addition highlights the methodological innovations and practical value of our work up front. Additionally, we have included a short transition paragraph at the beginning of the Methodology section on page 8, lines 318–321 that reinforces the contribution and links it directly to the proposed implementation. These changes ensure that readers can easily identify the novelty and significance of our approach from the outset.

We have added the following to the Abstract:

The main contributions of this study are as follows:
(1) We develop a dual-phase data quality pipeline for environmental time series, combining statistical and machine learning techniques for outlier detection and imputation.
(2) We propose a sequential strategy for handling zero values, isolated gaps, and prolonged missing sequences that preserves temporal integrity.
(3) We apply curvature-aware interpolation methods, specifically PCHIP and Akima. These methods preserve the natural shape of time series data during imputation. They significantly reduce error, with MSE between 0.002 and 0.004, and R² values between 0.95 and 0.97, on a 14-million-record dataset from the Royal Commission.
(4) We demonstrate that the proposed workflow is adaptable to other smart city and IoT datasets with minimal adjustments.

We have added the following to the Methodology:

To operationalize the proposed framework, we designed a structured data quality pipeline tailored for environmental time series. The approach combines domain-aware statistical analysis with machine learning techniques to detect and correct anomalies, preserving temporal continuity and minimizing imputation error.

Comment 2: In section 3, Data Description, the authors are requested to use tables to present the details of the dataset, such as the variables involved, the corresponding time period, etc. Response 2:

As requested, we have added two tables (Table 1 and Table 2) in Section 3.1 on pages 6–7, lines 264–277 (Dataset Overview) to summarize the key characteristics of the dataset. These tables present the types of variables (gas pollutants and meteorological measurements), their units, and the measurement frequency. Additional contextual information, such as the time period covered (60 months) and the number of monitoring locations (10), is provided in the table captions and accompanying text. These updates improve clarity and accessibility.

We have added and updated the following to Section 3.1:

The dataset includes hourly measurements of both gas pollutants and meteorological variables, covering a period of 60 months from Jan 2018 to Dec 2022. Jubail Industrial City, with its blend of industrial and residential zones, serves as an ideal case study for environmental monitoring (see Figure 1). The gas data in Table 1 includes pollutants such as carbon monoxide (CO), hydrogen sulfide (H₂S), sulfur dioxide (SO₂), nitric oxide (NO), nitrogen dioxide (NO₂), oxides of nitrogen (NO_x), ammonia (NH₃), and non-methane hydrocarbons (NMHC). Advanced metrics, such as NMHC 3-hour rolling average, total hydrocarbons (THC), benzene, ethyl benzene, MP-xylene (m,p-xylene), o-xylene, and toluene, are also captured.

Table 1: Gas Pollutant Variables for all 10 Sensors (Jan 2018–Dec 2022)

Gases Categorized by Pollutants Units	Unit	Collected
CO, H₂S	parts per million (ppm)	Hourly
SO₂, NO, NO₂, NO_x, NH₃, NMHC, THC, Benzene, Ethyl Benzene, (m,p-xylene), o-Xylene, Toluene	parts per billion (ppb)	Hourly

Meteorological data in Table 2 encompasses atmospheric temperature (at 2 m and 10 m heights), relative humidity (RH), pressure (PRES), and solar radiation (SR). Wind speed and direction are recorded at 10 m, 50 m, and 90 m, while soil temperature is monitored at depths of –0.05 m, –1.0 m, and –2.0 m below ground level. This extensive dataset provides valuable insights into both short-term fluctuations and long-term trends.

Table 2: Meteorological Variables for all 10 Sensors (Jan 2018–Dec 2022)

Category	Variables	Unit	Collected
Air Temperature	2 m, 10 m	°C	Hourly
Soil Temperature	–0.05 m, –1.0 m, –2.0 m	°C	Hourly
Relative Humidity	–	%	Hourly
Atmospheric Pressure	–	hPa	Hourly
Solar Radiation	–	W/m²	Hourly
Wind (10 m, 50 m, 90 m)	Speed, Direction	m/s, °	Hourly

Comment 3: Why are there no sensors 5 and 7 in Figure 1? What is the difference between these sensors? What are they detecting respectively? Response 3:

The numbering (1, 2, 3, 4, 6, 8, 9, 10, 11, 12) reflects the original identifiers provided in the dataset by the source agency. We received data from ten monitoring locations, but sensors labeled 5 and 7 were not included in the dataset available for this study. As a result, they do not appear in the figure or analysis.

All the listed sensors contributed gas and meteorological measurements as described in Section 3.1. While there may be minor differences in the available variables across some locations, our core analysis focuses on CO data, which is consistently available across all included sensors. To clarify this for other readers, we have also updated the caption of Figure 1 on page 8 to explain the missing sensor numbers.

Comment 4: In line 504, what does IQR in the formula stand for? Can authors describe the calculation method in some words, or give the specific values in a table? Response 4:

Thank you for pointing this out. In the original manuscript, the acronym “IQR” (Interquartile Range) was introduced in the Introduction (page 2, lines 79–80), but it was not explicitly defined at the point of use in line 504. We have corrected this in the revised manuscript by adding a clear definition and explanation in Section 4.4 (on page 14, lines 520–533).

To clarify, the IQR is calculated as the difference between the third quartile (Q₃) and the first quartile (Q₁). It is used in our outlier detection method to set bounds: any value below Q₁ – 1.5 × IQR or above Q₃ + 1.5 × IQR is flagged as an outlier.

We have updated the following to Section 4.4:

The IQR is defined as Q₃ – Q₁, where Q₁ and Q₃ are the 25th and 75th percentiles of the data. A reading is flagged as an outlier if it lies below Q₁ – 1.5 × IQR or above Q₃ + 1.5 × IQR. For example, for CO at sensor S1 we have Q₁ = 0.282 ppm and Q₃ = 0.617 ppm, so IQR = 0.335 ppm. This gives thresholds of 0.282 – 1.5 × 0.335 ≈ –0.225 ppm (truncated to 0 ppm) and 0.617 + 1.5 × 0.335 = 1.1195 ppm; any CO value outside [0, 1.1195] ppm is therefore flagged as an outlier.

Comment 5: In lines 756–779, for sequence missing values, how are the models set up? For example, what are the variables involved in multiple linear regression? Which variables does KNN select as features? Response 5:

We have clarified in the manuscript that both regression-based imputations for CO rely only on its immediate temporal neighbors (CO_t–1 and CO_t+1) as predictors. This simple setup avoids overfitting while capturing local trends effectively. Similarly, the KNN imputer is applied univariately to the CO series, using k = 5 and Euclidean distance computed from observed CO values. Each missing point is estimated as the average of its five nearest non-missing neighbors in value space.

Both regression-based imputations estimate a missing CO concentration at time t by regressing on its immediate temporal neighbors. Concretely, we fit:
CO_t = β₀ + β₁·CO_t–1 + β₂·CO_t+1 + ε_t;
where CO_t–1 and CO_t+1 are the known CO measurements at the preceding and following time steps.

For the KNN imputer, we use a univariate approach: each non-missing time point s is represented by its scalar CO_s, distances are computed as d(s,t) = |CO_s – CO_t|, with k = 5, and the imputed value is:
CO_t = (1/k) ∑_i∈Nₖ(t) CO_i;
where Nₖ(t) denotes the set of the k nearest neighbors of time t in CO space.

Comment 6: For sequence missing values, how does the length of the missing series affect the performance of missing value estimation? How long is the length of missing sequence data allowed by the sequence missing estimation method mentioned in the article? Response 6:

Thank you for raising the question regarding the impact of gap length on imputation performance. In our dataset, sequential CO gaps extended up to six days, although gaps longer than 24 hours represented less than 5 % of all missing cases. Over this full range, the interpolation methods PCHIP and Akima maintained strong performance, with MSE values between 0.002 and 0.004 and R² values approaching 0.95. This indicates that their effectiveness was largely preserved even in the presence of longer gaps, and that such gaps had only a minor influence on overall accuracy.

We have added the following to Section 6.7 (Limitations):

Our analysis did not explicitly assess how imputation performance varies across different gap lengths, particularly for extended sequences, which may affect generalizability to datasets with more frequent or prolonged missing intervals.

We have added the following to Section 7.2 (Future Work):

A valuable direction for future work is to more systematically evaluate how imputation performance varies with the length of missing data sequences, particularly for longer gaps beyond those commonly observed in this dataset.

Comment 7: What are the advantages of the method mentioned in the article compared to other deep learning methods based on CNN, LSTM, etc.? Such as,

[1] Khan, M. A. (2024). A Comparative Study on Imputation Techniques: Introducing a Transformer Model for Robust and Efficient Handling of Missing EEG Amplitude Data. Bioengineering, 11(8), 740.

Response 7:

These models have shown impressive capabilities in modeling complex temporal dependencies, particularly in high-dimensional or multivariate contexts. Our study, however, focuses on univariate, moderate-resolution environmental sensor data intended for real-time preprocessing in smart city deployments.

We evaluated a range of imputation strategies—including statistical methods (IQR, Z-Score), machine learning–based outlier detection (LOF, Isolation Forest), and interpolation techniques (PCHIP, Akima, Kalman Smoothing)—and selected them based on three considerations:

Accuracy for this dataset: MSE = 0.002–0.004, R² = 0.95–0.97 on a 10 % hold-out set.
Efficiency and deployment: Lightweight, training-free, and suitable for real-time edge-device implementation.
Interpretability and transparency: Modular steps that domain experts can trace, critical for environmental monitoring.

While the cited deep learning methods are promising, our pipeline met our accuracy needs with far lower complexity and resource requirements. We have noted benchmarking against deep learning approaches in Section 7.2, updated the Previous Work section, and expanded the Discussion accordingly.

Comment 8: In line 258, a – is missing before 1.0 m. Response 8:

We appreciate the reviewer’s attention to detail. “Nice catch!” The missing negative sign before “1.0 m” on page 6, line 276 has been added in the revised manuscript.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Dear Authors,

After reviewing your responses to my previous comments, I find that they have been clearly and satisfactorily addressed. Your replies were clear and concise, and I have no objections regarding the clarifications and corrections made to the manuscript.

The only observation I would like to add is that, in my opinion, the manuscript still appears somewhat lengthy. I believe that a more concise version, ideally under 30 pages, could have been sufficient. However, this is merely a personal suggestion and does not constitute a request for further revision.

Article Menu

Improving Time Series Data Quality: Identifying Outliers and Handling Missing Values in a Multilocation Gas and Weather Dataset

Major Comments

Specific Comments on Presentation and Figures

Technical and Methodological Concerns

Minor Issues

Recommendation

Respond to the Reviewer

Response to Reviewers

Further Information

Guidelines

MDPI Initiatives

Follow MDPI