Guidance for Interactive Visual Analysis in Multivariate Time Series Preprocessing
Abstract
1. Introduction
1.1. Problem
1.2. Objectives
- We designed a modular architecture that incorporates guidance levels, algorithm recommendations, and explainability mechanisms to bridge the knowledge gap and support decision-making in multivariate time series preprocessing.
- We designed a systematic workflow by identifying and organizing tasks, subtasks, and algorithms for multivariate time series preprocessing.
- We developed and implemented an interactive visual analysis tool that embodies the proposed architecture and assists users in preprocessing multivariate time series.
- We evaluated the proposed approach through case studies and user studies, measuring criteria of usability, explainability, and quality of results.
1.3. Contributions
- The preprocessing phase is incorporated as part of the interactive visual analysis process for multivariate time series.
- A proposal of a guidance system for interactive visual analytics, focused explicitly on preprocessing multivariate time series, is given.
- An introduction of automatic recommendations for tasks, subtasks, and algorithms as a key element of the guidance system is provided.
- The integration of visual explainability within the guidance process is achieved, extending the concept of explainability (widely used in AI/XAI) to the novel notion of explainable guidance (XG).
- This study contributes to the scientific community through a set of reproducible strategies for integrating guidance into multivariate time series analysis, including workflow templates and validation criteria.
1.4. Paper Structure
2. Related Work
2.1. Works on Visual Analysis with Guides
2.2. Works on Preprocessing
2.3. Works on Null Value Imputation Techniques
2.4. Works on Explainability
2.5. Works on Guidance and Explainability Evaluation
3. Proposal
3.1. Proposal Architecture
3.2. Component of Guidance for Interactive Visual Analysis in the Preprocessing
3.2.1. Configuration Module
- Definition of tasks and subtasksWe defined primary tasks (Table 4) and subtasks (Table 5), according to the state-of-the-art review conducted in Section 2.2. They included tasks for trend, seasonality, and cyclicality analysis.
- Definition of algorithmsBased on prior experience and analysis of the state of the art, a selection of algorithms was made for each task/subtask. The focus was on the publications on algorithm review:
- -
- Identification of null value imputation techniques: The proposal is based on using techniques to impute data, rather than eliminate it [26,27,28,29].Based on the state-of-the-art analysis conducted in Section 2.3, Table 3, we selected all proposals with multiple references of two or more. Clustering can distort sequential patterns such as trends, seasonality, or temporal interrelationships between variables. On the other hand, SVMs are not designed to model temporal dependencies or relationships between variables, and, for multivariate regression tasks, they require complex architecture. For all these reasons, we ruled out SVM and clustering.There is a little-referenced technique known as Legendre Polynomials. This technique yielded promising results in tests with our time series.
- -
- -
- Techniques for normalization: Based on the most referenced scientific articles on normalization techniques, the most robust and widely used techniques for multivariate time series were identified [35,36]. These techniques allow data preparation when the variables present different magnitudes or variations, facilitating a fair comparison between them without requiring a nonlinear transformation or altering the temporal pattern of the data.
- -
- -
- Techniques for dimensionality reduction: According to the state of the art, it was proposed to use the two most representative techniques for dimensionality reduction: principal component analysis (PCA) and factor analysis (FA). In PCA, data distribution is not a requirement, and it is widely used for data whose variables are highly correlated, making it computationally efficient. FA is beneficial when one wants to find latent correlations [38,39].
Finally, a summary of all the algorithms used for the tasks/subtasks of the workflow are presented in Table 6. - Definition of visual interfacesThe design of visual interfaces was not oriented toward graphic sophistication but rather toward clarity and expressiveness. For the main workflow visualization, a dynamic graph with a hierarchical structure was adopted as it best expressed structured navigation.The selection of visual interfaces is shown in Table 7.
3.2.2. Workflow Module
3.2.3. Recommendation Module
- Recommendation of tasks and subtasks.
- -
- Multiple validation algorithms are run (there may be 2, 3, or more).
- -
- Each algorithm returns a Boolean value TRUE or FALSE, depending on specific conditions that justify the need to implement a task or subtask.
- -
- If at least 50% of the algorithms return TRUE, the system suggests implementing the task or subtask.
- -
- If the majority does not exceed 50%, the task or subtask is not recommended.
- Below are the details of the recommendation for each task or subtask; there is no single criterion as each task has its particularity:
- (A)
- Cleaning: The cleaning task recommendation is based on a combination of different statistical techniques to identify noise and outliers:
- -
- Calculates noise for each variable based on the standard deviation if the values are outside .
- -
- Calculates noise for each variable based on the coefficient of variation according to the following condition: .
- -
- Detects outliers for each variable, for which it calculates the IQR (IQR = Q3 − Q1); then, values outside the limits [Q1 − 1.5.IQR, Q3 + 1.5.IQR] are considered outliers.
- -
- Detects outliers for each variable with a Z_Score, where those greater than or equal to 3.5 are considered outliers.
- -
- Applies the Grubbs test to detect an extreme value (minimum or maximum) for each variable, where .
- -
- Each test returns a True or False value.
- -
- Tallies the tests. If at least 50% of the tests detect problems in the data (returning True), a cleanup task is recommended.
- (B)
- Outliers: Determines whether there are outliers in the data by implementing three primary methods: Z-score, IQR (Interquartile Range), and Grubbs Test.
- -
- Calculates the Z-score for each variable (item iterates by column), using the default threshold of 3.5. Outliers greater than or equal to the threshold value are considered outliers. The result is the number of outliers and a Boolean value of True, indicating the existence of outliers.
- -
- Calculates the IQR for each variable (IQR = Q3 − Q1). It calculates the IQR limits, [Q1 − 1.5.IQR, Q3 + 1.5.IQR] and values outside the limits are considered outliers. The result is returned as the quantity and the Boolean value.
- -
- The Grubbs Test detects an extreme outlier, either the minimum or maximum per column . The result is returned as a Boolean value.
- -
- The methods used for statistical tests include standard deviation to detect noise and coefficient of variation to identify variability.
- -
- Additionally, noise detection (extreme peaks or high variability) is performed. For each variable, it calculates the mean () and standard deviation of the series () and then identifies outliers . If there are outliers, it marks the column as noisy with the value True.
- -
- Finally, it counts the Boolean values equal to True; if the result is ≥50%, it concludes that outliers have been detected, so the task is recommended.
- (C)
- Normalization: This is applied based on two main criteria: seasonality detection and distribution evaluation.
- -
- Seasonality detection: The presence of seasonal patterns that may influence the distribution of the series is assessed. To do this, the MinMaxScaler is first applied to avoid scaling biases. A seasonal decomposition is then performed to separate the series into its trend, seasonality, and residual components. The variances of the seasonal and residual components are then compared. Suppose that the variance of the seasonal component is greater in at least one variable. In that case, it is concluded that seasonality impacts the distribution of the series, so a normalization process is recommended.
- -
- Distribution evaluation: This determines whether the data follow a known distribution (normal or log-normal) after scaling. It then scales the data using MinMaxScaler, calculates the standard deviation for each variable, and compares it with the standard deviation expected from a normal distribution. If the difference is slight, the series follows a normal distribution. Otherwise, it then attempts to fit a log-normal distribution to verify whether it belongs to a family of known distributions. If the series is not regular, it is marked as true.
- -
- Final Decision: Combines the test results to decide whether normalization is recommended, running the seasonality and distribution detection functions in parallel. From the results, if 50% or more of the tests return a value equal to true, this indicates the presence of seasonality or atypical data; therefore, normalization is recommended. Otherwise, it is not necessary. This activates all four available scaling types, leaving the final choice up to the user.
- -
- If no additional data quality issues are detected, but the series scales contain both negative and positive values, only MinMax normalization is enabled, as this option could be helpful to the user after analyzing the signal.
- (D)
- Dimensionality reduction: This is carried out as follows.
- -
- The correlation between variables is assessed using different methods (Pearson, Spearman, and Kendall) to measure the correlation between the scaled variables. If many variables are found to be highly correlated (above a defined threshold), this indicates that redundancies exist in the data that could be reduced.
- -
- Multicollinearity is detected: The Variance Inflation Factor (VIF) is calculated for each variable. If some variables are found to have a high VIF (greater than 10), this suggests the existence of high multicollinearity, which supports the need to reduce dimensionality.
- -
- Dimensionality reduction tests are performed, including principal component analysis (PCA), which evaluates the cumulative variance of the components. If the cumulative variance does not reach a threshold (0.8) for all components, dimensionality reduction is possible. Factor analysis (FA) is also performed, where the magnitude of the factor loadings is analyzed and variables with low values (below a threshold of 0.4) are identified for potential elimination.
- -
- If the majority of these tests (at least 50%) indicate redundancies or high correlation, it is concluded that dimensionality reduction is necessary.
- (E)
- Transformation: The recommendation is based on the assessment of stationarity and persistence. To verify whether the time series is non-stationary, non-stationarity tests are applied, specifically the Augmented Dickey–Fuller (ADF) and Kwiatkowski–Phillips–Schmidt–Shin (KPSS) statistical tests. If either of these tests indicates non-stationarity, the series is considered a possible candidate for transformation.Additionally, to assess persistence, the Hurst exponent (H) is calculated, which measures whether the series exhibits persistent or anti-persistent behavior. If , this means that the series exhibits a long-term trend and may require transformation to improve its behavior. This procedure enables us to determine whether a time series requires a transformation to become stationary or enhance its behavior before applying other analyses.
- Recommendation of algorithms for null values.The defined algorithms (Table 6) are executed and evaluated using the Weighted Mean Absolute Percentage Error (WMAPE) metric to identify the most appropriate algorithm and prioritize the recommendation for execution (Figure A1 in Appendix A). The WMAPE metric is used because it is more robust compared to other existing metrics [43,44]. The best model is selected using the following criteria:The null filling procedure uses both non-predictive and predictive models. In all cases, the dataset is split into three parts: training, validation, and test datasets. For predictive models, the model is trained on the training set, and the data is predicted using the validation set. The metric is obtained by evaluating the predicted data from the validation set against its actual values. The model with the best weighting is then used to predict the null dataset. A detailed explanation follows.
- (a)
- Given a series, an evaluation of the null values is performed (Table 9).
- (b)
- The null values are then separated by generating two datasets: one containing the complete data and another with the null data, ensuring that the original indexes are respected (Table 10).
- (c)
- Three datasets are then generated: Of the complete data, 80% is used to train the model, 20% is used for validation, and, from this, the metric is obtained by comparing the predicted data result with its actual data. The third null dataset is completed with the model that best weights according to the WMAPE metric (Table 11).
- (d)
- Finally, the entire series is assembled by completing the null data.
For non-predictive models, the procedure is similar; however, instead of training the model with 80% of the whole dataset and validating it with the remaining 20%, the mean and median are calculated. These values are compared with the actual values of 20% of the whole dataset to obtain the metric evaluation.- -
- Implementation of algorithm explainability for null values.In order to provide transparency and facilitate understanding of the recommendation process, an explanatory visualization is implemented (a two-way matrix, relating algorithms and variables, and using the WMAPE metric as a key performance indicator), which allows the user to analyze why specific algorithms are suggested over others for handling null values (Figure 4).
- (a)
- Y-axis (columns): Algorithms evaluated for imputation. X-axis (rows): Time series variables with null values. Matrix cells: WMAPE value obtained by each algorithm when imputing a given variable.
- (b)
- Cells highlighted in red indicate the lowest WMAPE per variable (best performance in terms of relative error). The visual highlighting makes it easy to identify which algorithms are most effective immediately. The system quantifies the number of times each algorithm achieved the best performance (minimum WMAPE).
- (c)
- Those algorithms with the highest number of red cells are prioritized and recommended.
- (d)
- In the dynamic hierarchical graph, prioritization is reflected with a more intense color in the nodes corresponding to the recommended algorithms (Figure A1 in Appendix A)).
- (e)
- When interacting with the visualization (by hovering over a node), additional contextual information is displayed: WMAPE definition and formula, recommended algorithm description, and WMAPE extreme values by variable.
- Recommendation of algorithms for outlier detection.The selection of the appropriate algorithm is based on a sequential comparison strategy between two complementary approaches: Z-score (parametric) and IQR (non-parametric) (Figure A2 in Appendix B).
- (a)
- The Z-score is applied and the detected outliers are replaced with the median of the series.
- (b)
- IQR is applied to the modified data to detect residual outliers. The system recommends the Z-score if there are no residuals; otherwise, it recommends the IQR.
- (c)
- Therefore, the Z-score is optimal when the data is normally distributed; if outliers persist after its application, IQR is recommended.
- -
- Implementing explainability of algorithms for outliers.In order to make the algorithm selection transparent, a comparative visualization is implemented, based on a bar chart (Figure 5), where each bar represents an algorithm (Z-score or IQR) and the height indicates the percentage of outliers successfully treated by each method. In case of a tie (both methods treat 100% of the outliers), the use of the Z-score is privileged as the standard method, consistent with the assumption of normality and its computational efficiency [45].
- Recommendation of algorithms for dimensionality reduction.Various heuristic tests are applied to the data, and the results are then combined to determine whether dimensionality should be recommended (Figure A3 in Appendix C).
- (a)
- The Kaiser Meyer Olkin (KMO) index is calculated on the scaled data. This index allows us to assess whether the dataset presents sufficient partial correlation to allow for the application of dimensionality reduction techniques.
- (b)
- If the KMO value is greater than or equal to 0.7, factor analysis (FA) is recommended, as the data are suitable for identifying latent factors.
- (c)
- If the KMO value is less than 0.7, principal component analysis (PCA) is used due to its robustness in scenarios with low correlation between variables.
- (d)
- The final recommendation is made by considering the consensus among the tests and selecting the method that best preserves the temporal structure of the series (trend and seasonality).
- -
- Implementation of explainability of algorithms for dimensionality reduction.To explain the algorithm’s recommendation, the calculated value of the KMO index is visualized, accompanied by a correlation matrix that allows the user to explore the degree of correlation between variables visually (Figure 6).
- (a)
- If KMO ≥ 0.7, it is recommended to use FA.
- (b)
- If KMO < 0.7, it is recommended to use PCA.
- (c)
- The explanation is complemented by textual information in the black and yellow background boxes.
- Recommendation of normalization algorithms.The recommendation is based on a structured approach that seeks to preserve the temporal structure of multivariate series. This approach consists of two stages: (1) analysis of the need for normalization and (2) selection of the most appropriate scaling method.
- (a)
- First, the data is checked for normalization by evaluating two key aspects: the presence of seasonality and its fit to a known distribution. If either of these two criteria is positive, the normalization recommendation process is activated.
- (b)
- To identify the most appropriate scaler, four versions of the same dataset are created using the MaxAbsScaler, MinMaxScaler, StandardScaler, and RobustScaler methods. Seasonal-Trend decomposition using LOESS (STL) is then applied to each scaled version and the original data, extracting the trend and seasonality components.
- (c)
- To evaluate which scaling method best preserves temporal structure, the correlation between the trend and seasonality components of the original version versus each scaled version is calculated. For each scaler, an average value of these correlations is obtained per variable, which represents its ability to preserve temporal dynamics.
- (d)
- The algorithm with the highest average correlation is considered the most suitable for the analyzed dataset.
- -
- Implementing explainability of algorithms for normalization.The recommendation is accompanied by a bar chart, where each bar represents a scaling method and its average correlation value. This visualization makes it easy to compare the relative performance of the algorithms. The system reinforces this explanation with pop-up textual information boxes when interacting with each visual element, making the selection criteria easier to understand.
- Recommendation of transformation algorithms.The recommendation is accompanied by a bar chart in which each bar represents a scaling method and its average correlation value. This visualization allows for easy comparison of the algorithms’ relative performance. The system reinforces this explanation with pop-up textual information boxes when interacting with each visual element, facilitating understanding of the selection criteria.
- (a)
- Augmented Dickey–Fuller (ADF): Detects the presence of non-stationarity.
- (b)
- KPSS: Evaluates stationarity in terms of levels.
- (c)
- Hurst Exponent (H): Measures the degree of persistence of the long-term series, which supports the transformation.
- (d)
- Each test returns a Boolean value. If at least 50% of the tests return TRUE, the system recommends applying a transformation.
3.2.4. Guidance Level Module
- Orientation:The main objective is to build or maintain the user’s mental map. Our proposal implements the following:
- -
- History of Actions: In the graph, the system uses the color blue on the nodes and edges to indicate the path of the executed actions, while stacked windows are displayed on the sides with the record of completed tasks.
- -
- Assigning views: The system implements statistical functions that allow for tracking changes in data in the preprocessing stage and analyzing the behavior of the series.
- -
- Relationships between datasets: Within the statistical functions, there is an option that allows you to examine the relationships between the variables in the series.
- -
- Highlight the recommended actions: The system uses visual clues in the graph to represent the actions taken, those about to be executed, and the corresponding indications.
- -
- Informative text boxes: Emergent messages.
- -
- Customization of geometric shapes: The graph uses nodes of different sizes to identify the elements of the workflow.
- Direction:This grade of guidance focuses on providing alternatives and options for executing actions in the analysis process; guiding users through possible routes, strategies, or methods; and facilitating their selection based on their needs or context.
- -
- Visual clues and priority signals: The user recognizes the system’s recommendation through visual cues (colors), accesses explainability through visualizations, and obtains additional information through text boxes.
- -
- Alternatives and options: The graph suggests different routes, actions, and algorithms for executing tasks and subtasks, as well as the possibility of interaction through entering or modifying parameters.
- Prescription:The system automatically displays a list of algorithms for each task/subtask and presents a structured workflow sequence based on established rules.
3.2.5. Guided Graph
- The graph is a diagram with three parent nodes representing tasks: Data Quality, Data Reduction, and Behavior Variables, connected by edges that indicate the dependencies between tasks.
- The traversal of the graph begins at the Start node and continues through the first level of nodes of the three tasks (parents), which are displayed larger than the other sublevels.
- Subtasks represent the next level, with nodes smaller than those for tasks and sometimes with another subtask level.
- At the last level are the nodes that represent the algorithms, which are smaller in size.
- The edges are colored based on the activities, as indicated in the legend, signifying a dynamic component based on the progress of the analysis process.
- The system connects the algorithm’s nodes to the next task’s nodes via borders, forming a user’s route.
3.3. Component Tool for Guided Visual Analysis
4. Experiments and Results
4.1. Proposal Configuration
4.2. Test Dataset
- Spain Time Series: The time series dataset Air Quality in Madrid (2001–2018) was used, extracted from the Kaggle repository [46].
- Brazil Time Series: The Qualidade do Ar time series dataset was used, extracted from the São Paulo State Environmental Company, Brazil [47].
- India Time Series: The Air Quality Data in India (2015–2020) time series dataset was used, extracted from the Kaggle repository [48].
- Peru Time Series: The Arequipa time series dataset was used, extracted from the website of the National Meteorological and Hydrological Service of Peru—SENAMHI [49].
- Bitcoin Time Series: The BitCoin Historical Data dataset was used, obtained from https://www.investing.com/crypto/bitcoin/historical-data, URL (accessed on 10 July 2025).
4.3. Case Study
- The process began with the visualization of a list of the raw time series (Madrid, India, Brazil, Arequipa, and Bitcoin). Upon selection, the stations were displayed, each with information on variables, null values, and the number of records. This gave the user an initial overview of the series (degree of orientation guidance) (Figure A4 in Appendix D).
- The visualization in Figure 7 is the primary interaction screen that shows the integrated workflow for preprocessing and analyzing the behavior of time series.
- The display of general information is complemented by more detailed information accessible via the Statistics and Time Series buttons. This availability provides information at the time it is needed (timeliness, explainability, transparency) regarding changes after preprocessing. It facilitates data interpretation throughout the analysis. The Statistics option displays information on the correlation matrix, bivariate analysis, and descriptive statistics (Figure A5 and Figure A7 in Appendix D). Time Series displays visual information about the series (Figure A8 in Appendix D).
- The graph guides the tasks to be performed through visual color cues, as indicated in the legend:
- -
- Pending activities: Nodes and edges are displayed in red when the system recommends a route, task, subtask, or algorithm that has not been applied. The guide is generated from the analysis in the recommendation module, which evaluates the workflow.
- -
- Applied activities: Nodes and edges are displayed in blue to indicate actions that have been executed, providing a clear mental map of the process and facilitating workflow tracking.
- -
- Not pending activities: Nodes and edges are displayed in gray when actions cannot yet be executed due to pending tasks that must be completed first.
- -
- No action required: Nodes and edges are displayed in green when no further action is required.
- -
- Optional activity: Nodes and edges are displayed in orange when the decision to execute is optional and at the user’s discretion.
- The system recommends treating null values first (red color), and, among the algorithms (7), the nodes rolling mean and random forest were recommended (displayed in darker shades). However, users can choose and execute a different algorithm than the one recommended (Figure A1 in Appendix A). This visual recommendation allows for a reduction in cognitive load as it simplifies decision-making. It focuses on interpreting results instead of comparing algorithms. Therefore, the recommendation is relevant, at the right time (opportunity), suggesting the best option (guide degree of direction).
- The explainability in Figure 4 shows six red boxes in the row of the random forest algorithm and one in the rolling mean. Of the two recommended algorithms, the one that performed best was random forest. The information text also indicates the recommended algorithm. Explainability through visualization and text clarifies the reasons for the recommendation (transparency and trust).
- After imputing null values, the system recommends running the Interquartile Range algorithm for outlier treatment (Figure A2 in Appendix B). The explanation of its recommendation shows that 100% of the outliers in the IQR are treated (Figure 5 about explainability of algorithm outliers).
- For the next task, the system optionally (orange) recommends the dimensionality reduction/PCA algorithm task (Figure A3 in Appendix C). As part of the explainability, a correlation matrix table is displayed to analyze the correlation between variables and decide which variable to reduce. KMO = 0.01; therefore, the PCA algorithm was recommended (Figure 6).
- The necessary preprocessing tasks were completed. The complete path of the actions performed is displayed in the graph (blue color). The series behavior nodes or buttons were enabled for their respective analysis (Figure A15 in Appendix E). Access through the button allowed us to select a specific range of the series (Figure A11 and Figure A12 in Appendix D). With access through the nodes, the behavior of the series was displayed for each variable (Figure A16 and Figure A17 in Appendix E).
- The system displays stacked windows on the left and right sides of the graph. They record executed actions that can be used to reverse the desired action.
- The tool features a set of buttons on the upper right corner of the main screen, enabling various actions and relevant information: HOME (select the time series to analyze), ASSETS (allows you to export the preprocessed series and import a new multivariate time series for analysis, NAVIGATION (primary screen), TIME SERIES (displays the series, trend, seasonality, spiral, and statistics). (See Appendix D and Figure A9, Figure A10, Figure A11, Figure A12 and Figure A14.)
- Once the analysis is complete, the user can export the data for future use, primarily for predictive models. This option is found in the ASSETS button.
4.4. Evaluation of the Explainability Model
- Clarity (Q1: Simplicity, Q2: Accuracy, Q3: Transparency)
- Timeliness (Q4: Selectivity, Q5: Efficiency)
- Interpretability (Q6: Completeness, Q7: Efficacy, Q8: Anthropomorphic)
- Feature Explanation (Q9: Relevance, Q10: Adaptability)
- Cognitive Relief (Q11: Efficiency, Q12: Persuasion)
- Confidence According to User Expectations (Q13: Competence, Q14: Consistency, Q15: Usability)
- Stages of the Trust Relationship (Q16: Initial, Q17: Intermediate, Q18: Final, Q19: Interactivity)
- Informed Decision-Making (Q20)
- Satisfaction (Q21)
4.5. Evaluation of the Guidance System
- Flexibility (H1)
- Adaptability (H2: Guide change, H3: Experience adjustment)
- Visibility (H4: Easy identification, H5: Visible state and parameters)
- Controllability (H6: Explicit control, H7: Alternative suggestions, H8: Feedback, H9: Parameter adjustment)
- Explainability (H10: Understanding the results guided by the system, H11: Suggestions from the guides are easy to understand, H12: Understanding why the suggestions are provided, H13: Able to request explanations, H14: Able to trust the guides)
- Expressiveness (H15: Clear Language, H16: Unequivocal Coding)
- Timeliness (H17: Just-in-time guidance, H18: Does not disrupt workflow)
- Relevance (H19: The guide helps to overcome analysis dead-ends, H20: The guide helps to complete tasks by reducing errors, H21: The guide saves time, H22: The guide facilitates reasoning, H23: The guide helps answer questions about the data, H24: The guide is appropriate for the task being performed, H25: The guide helps make discoveries, H26: The guide helps generate new hypotheses about the data, H27: Applying the guide helps one to feel confident about the results.)
5. Discussion and Future Work
5.1. Discussion on the Result of the Guidance System
5.2. Discussion of the Results of the Explainability of the Model
5.3. Discussion of the Results of the Case Study
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Recommended Algorithms for Null Values
Appendix B. Recommended Algorithms for Outliers
Appendix C. Recommendation of the Algorithm for Dimensionality Reduction
Appendix D. Case Study
Appendix E. Analysis of Time Series Behavior
References
- Sperrle, F.; Ceneda, D.; El-Assady, M. Lotse: A practical framework for guidance in visual analytics. IEEE Trans. Vis. Comput. Graph. 2022, 29, 1124–1134. [Google Scholar] [CrossRef]
- Ceneda, D.; Andrienko, N.; Andrienko, G.; Gschwandtner, T.; Miksch, S.; Piccolotto, N.; Schreck, T.; Streit, M.; Suschnigg, J.; Tominski, C. Guide me in analysis: A framework for guidance designers. Comput. Graph. Forum 2020, 39, 269–288. [Google Scholar]
- Streit, M.; Schulz, H.J.; Lex, A.; Schmalstieg, D.; Schumann, H. Model-driven design for the visual analysis of heterogeneous data. IEEE Trans. Vis. Comput. Graph. 2011, 18, 998–1010. [Google Scholar] [CrossRef]
- Milani, A.M.P.; Loges, L.A.; Paulovich, F.V.; Manssour, I.H. PrAVA: Preprocessing profiling approach for visual analytics. Inf. Vis. 2021, 20, 101–122. [Google Scholar] [CrossRef]
- Fan, C.; Chen, M.; Wang, X.; Wang, J.; Huang, B. A review on data preprocessing techniques toward efficient and reliable knowledge discovery from building operational data. Front. Energy Res. 2021, 9, 652801. [Google Scholar] [CrossRef]
- Ceneda, D.; Gschwandtner, T.; May, T.; Miksch, S.; Schulz, H.J.; Streit, M.; Tominski, C. Characterizing guidance in visual analytics. IEEE Trans. Vis. Comput. Graph. 2016, 23, 111–120. [Google Scholar] [CrossRef] [PubMed]
- Collins, C.; Andrienko, N.; Schreck, T.; Yang, J.; Choo, J.; Engelke, U.; Jena, A.; Dwyer, T. Guidance in the human–machine analytics process. Vis. Inform. 2018, 2, 166–180. [Google Scholar] [CrossRef]
- Musleh, M.; Raidou, R.G.; Ceneda, D. TrustME: A Context-Aware Explainability Model to Promote User Trust in Guidance. IEEE Trans. Vis. Comput. Graph. 2025, 31, 8040–8056. [Google Scholar] [CrossRef]
- Zhang, Y.; Perer, A.; Epperson, W. Guided Statistical Workflows with Interactive Explanations and Assumption Checking. In Proceedings of the 2024 IEEE Visualization and Visual Analytics (VIS), Pete Beach, FL, USA, 13–18 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 26–30. [Google Scholar]
- Ha, S.; Monadjemi, S.; Ottley, A. Guided By AI: Navigating Trust, Bias, and Data Exploration in AI-Guided Visual Analytics. Comput. Graph. Forum 2024, 43, e15108. [Google Scholar] [CrossRef]
- Islam, M.R.; Akter, S.; Islam, L.; Razzak, I.; Wang, X.; Xu, G. Strategies for evaluating visual analytics systems: A systematic review and new perspectives. Inf. Vis. 2024, 23, 84–101. [Google Scholar] [CrossRef]
- Theissler, A.; Spinnato, F.; Schlegel, U.; Guidotti, R. Explainable AI for time series classification: A review, taxonomy and research directions. IEEE Access 2022, 10, 100700–100724. [Google Scholar] [CrossRef]
- Sperrle, F.; El-Assady, M.; Guo, G.; Borgo, R.; Chau, D.H.; Endert, A.; Keim, D. A survey of human-centered evaluations in human-centered machine learning. Comput. Graph. Forum 2021, 40, 543–568. [Google Scholar] [CrossRef]
- Ceneda, D.; Gschwandtner, T.; Miksch, S.; Tominski, C. Guided visual exploration of cyclical patterns in time-series. In Proceedings of the IEEE Symposium on Visualization in Data Science (VDS), Berlin, Germany, 21–26 October 2018; IEEE Computer Society: Piscataway, NJ, USA, 2018. [Google Scholar]
- de Luz Palomino Valdivia, F.; Baca, H.A.H.; Solis, I.S.; Cruz, M.A.; Valdivia, A.M.C. Guided interactive visualization for detecting cyclical patterns in time series. In Proceedings of the Future of Information and Communication Conference, San Francisco, CA, USA, 2–3 March 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 391–402. [Google Scholar]
- Luboschik, M.; Maus, C.; Schulz, H.J.; Schumann, H.; Uhrmacher, A. Heterogeneity-based guidance for exploring multiscale data in systems biology. In Proceedings of the 2012 IEEE Symposium on Biological Data Visualization (BioVis), Seattle, WA, USA, 14–15 October 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 33–40. [Google Scholar]
- May, T.; Steiger, M.; Davey, J.; Kohlhammer, J. Using signposts for navigation in large graphs. Comput. Graph. Forum 2012, 31, 985–994. [Google Scholar] [CrossRef]
- Gladisch, S.; Schumann, H.; Tominski, C. Navigation recommendations for exploring hierarchical graphs. In Proceedings of the Advances in Visual Computing: 9th International Symposium, ISVC 2013, Rethymnon, Crete, Greece, 29–31 July 2013; Proceedings, Part II 9. Springer: Berlin/Heidelberg, Germany, 2013; pp. 36–47. [Google Scholar]
- Han, W.; Schulz, H.J. Providing visual analytics guidance through decision support. Inf. Vis. 2023, 22, 140–165. [Google Scholar] [CrossRef]
- Pérez-Messina, I.; Ceneda, D.; El-Assady, M.; Miksch, S.; Sperrle, F. A typology of guidance tasks in mixed-initiative visual analytics environments. Comput. Graph. Forum 2022, 41, 465–4764. [Google Scholar] [CrossRef]
- de Luz Palomino Valdivia, F.; Baca, H.A.H.; Valdivia, A.M.C. Guided visual analysis of multivariate time series. In Proceedings of the Future of Information and Communication Conference, San Francisco, CA, USA, 3–4 March 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 247–262. [Google Scholar]
- García, S.; Luengo, J.; Herrera, F. Data Preprocessing in Data Mining; Springer: Berlin/Heidelberg, Germany, 2015; Volume 72. [Google Scholar]
- Çetin, V.; Yıldız, O. A comprehensive review on data preprocessing techniques in data analysis. Pamukkale ÜNiversitesi MÜHendislik Bilim. Derg. 2022, 28, 299–312. [Google Scholar] [CrossRef]
- Maharana, K.; Mondal, S.; Nemade, B. A review: Data pre-processing and data augmentation techniques. Glob. Transitions Proc. 2022, 3, 91–99. [Google Scholar] [CrossRef]
- Mallikharjuna Rao, K.; Saikrishna, G.; Supriya, K. Data preprocessing techniques: Emergence and selection towards machine learning models-a practical review using HPA dataset. Multimed. Tools Appl. 2023, 82, 37177–37196. [Google Scholar] [CrossRef]
- Emmanuel, T.; Maupong, T.; Mpoeleng, D.; Semong, T.; Mphago, B.; Tabona, O. A survey on missing data in machine learning. J. Big Data 2021, 8, 140. [Google Scholar] [CrossRef] [PubMed]
- Hasan, M.K.; Alam, M.A.; Roy, S.; Dutta, A.; Jawad, M.T.; Das, S. Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021). Inform. Med. Unlocked 2021, 27, 100799. [Google Scholar] [CrossRef]
- Joel, L.O.; Doorsamy, W.; Paul, B.S. A review of missing data handling techniques for machine learning. Int. J. Innov. Technol. Interdiscip. Sci. 2022, 5, 971–1005. [Google Scholar]
- Thomas, T.; Rajabi, E. A systematic review of machine learning-based missing value imputation techniques. Data Technol. Appl. 2021, 55, 558–585. [Google Scholar] [CrossRef]
- Ceneda, D.; Collins, C.; El-Assady, M.; Miksch, S.; Tominski, C.; Arleo, A. A heuristic approach for dual expert/end-user evaluation of guidance in visual analytics. IEEE Trans. Vis. Comput. Graph. 2023. [Google Scholar] [CrossRef]
- Burstein, F.; Holsapple, C.W. Handbook on Decision Support Systems 2: Variations; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
- Yaro, A.S.; Maly, F.; Prazak, P.; Malỳ, K. Outlier detection performance of a modified z-score method in time-series rss observation with hybrid scale estimators. IEEE Access 2024, 12, 12785–12796. [Google Scholar] [CrossRef]
- Sullivan, J.H.; Warkentin, M.; Wallace, L. So many ways for assessing outliers: What really works and does it matter? J. Bus. Res. 2021, 132, 530–543. [Google Scholar] [CrossRef]
- Dastjerdy, B.; Saeidi, A.; Heidarzadeh, S. Review of applicable outlier detection methods to treat geomechanical data. Geotechnics 2023, 3, 375–396. [Google Scholar] [CrossRef]
- Diba, K.; Batoulis, K.; Weidlich, M.; Weske, M. Extraction, correlation, and abstraction of event data for process mining. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2020, 10, e1346. [Google Scholar] [CrossRef]
- Cao, X.H.; Stojkovic, I.; Obradovic, Z. A robust data scaling algorithm to improve classification accuracies in biomedical data. BMC Bioinform. 2016, 17, 359. [Google Scholar] [CrossRef]
- Salles, R.; Belloze, K.; Porto, F.; Gonzalez, P.H.; Ogasawara, E. Nonstationary time series transformation methods: An experimental review. Knowl.-Based Syst. 2019, 164, 274–291. [Google Scholar] [CrossRef]
- Van Der Maaten, L.; Postma, E.O.; Van Den Herik, H.J. Dimensionality reduction: A comparative review. J. Mach. Learn. Res. 2009, 10, 13. [Google Scholar]
- Sorzano, C.O.S.; Vargas, J.; Montano, A.P. A survey of dimensionality reduction techniques. arXiv 2014, arXiv:1403.2877. [Google Scholar] [CrossRef]
- Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: Berlin/Heidelberg, Germany, 2013; Volume 26. [Google Scholar]
- James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: Berlin/Heidelberg, Germany, 2013; Volume 112. [Google Scholar]
- Sedgewick, R.; Wayne, K. Algorithms; Addison-Wesley Professional: Boston, MA, USA, 2011. [Google Scholar]
- Das, S.; Roca-Feltrer, A.; Hainsworth, M. Application of a Weighted Absolute Percentage Error-Based Method for Calculating the Aggregate Accuracy of Reported Malaria Surveillance Data. Am. J. Trop. Med. Hyg. 2025, 113, 37. [Google Scholar] [CrossRef] [PubMed]
- Roy, S.S.; Samui, P.; Nagtode, I.; Jain, H.; Shivaramakrishnan, V.; Mohammadi-Ivatloo, B. Forecasting heating and cooling loads of buildings: A comparative performance analysis. J. Ambient. Intell. Humaniz. Comput. 2020, 11, 1253–1264. [Google Scholar]
- Airaksinen, V. The Role of a Brand in Customer Experience. KAMK University of Applied Sciences, Administration Spring. 2022. Available online: https://www.theseus.fi/bitstream/handle/10024/754694/Airaksinen_Vilma.pdf?sequence=2 (accessed on 24 January 2025).
- Decide Soluciones. Air Quality Madrid Dataset. 2023. Available online: https://www.kaggle.com/datasets/decide-soluciones/air-quality-madrid (accessed on 10 July 2025).
- CETESB. QUALAR—Qualidade do Ar. 2023. Available online: https://sistemasinter.cetesb.sp.gov.br/ar/php/mapa_qualidade_rmsp.php (accessed on 10 July 2025).
- Vopani. Air Quality Data in India (2015–2020). 2023. Available online: https://www.kaggle.com/datasets/rohanrao/air-quality-data-in-india (accessed on 10 July 2025).
- Senamhi. Servicio Nacional de Meteorología e Hidrología del Perú. 2023. Available online: https://www.senamhi.gob.pe/site/descarga-datos/ (accessed on 10 July 2025).
Reference | (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | (9) | Characteristics |
---|---|---|---|---|---|---|---|---|---|---|
Ceneda et al. [6] | X | X | X | X | X | X | (1) Conceptual model of guidance in VA (characterizing). (3) Importance of perception and cognition in obtaining knowledge. (7), (8), (9) Author of the three degrees of guidance. | |||
Ceneda et al. [14] | X | X | X | X | (2) Design and implementation of an interactive system using a spiral visualization. (3) Difficulty for users to configure cycle length parameters. (5) Implementation and design of the interface. (8) Visual cues on the spiral suggest cycle length configurations based on statistical results. | |||||
Ceneda et al. [2] | X | X | X | X | (1) Approach to a conceptual framework for guide designers in VA. (3) Lack of understanding of the data. (6) The guide reduces mental load by guiding users through the initial stages of analysis. (8) The framework suggests directions and steps for designers, using cues such as requirements and examples to guide the process without imposing complete solutions. | |||||
Palomino et al. [21] | X | X | X | X | X | X | X | X | (1) Proposes guidance and describes interfaces and operators for time series. (2) Implements a task flow tool. (3) Helps with preprocessing tasks. (4) Suggests alternatives of algorithms and routes to follow. (5) Shows algorithms that can be run on preprocessing tasks. (6) Gaining knowledge. (7) Alternatives to algorithms and tasks. (8) Visual signs of the routes to follow. | |
Luboschik et al. [16] | X | X | X | X | (1) Conceptual framework. It seeks to help users understand patterns in multiscale data through visual guidance. (3) Identifies interesting regions in multiscale data. (6) Focuses on perceptual guidance and visual design. (7) Provides heterogeneity metrics to guide the user toward regions of interest. | |||||
May et al. [17] | X | X | X | X | (1) Proposes navigation concepts and methods, with a focus on conceptual frameworks. (3) Seeks to facilitate the exploration and understanding of large graphs through visual techniques. (6) Focuses on perceptual and visual design aspects. (7) Provides alternatives to guide navigation. | |||||
Streit et al. [3] | X | X | X | X | (1) Proposes a model-driven design process and describes interfaces, operators, and data. (3) Facilitates the understanding and visual analysis of heterogeneous data through a structured design. (6) Focuses on aspects of visual design and perceptual analysis. (7) Presents alternatives through defined interfaces and routes. | |||||
Gladisch et al. [18] | X | X | X | X | (1) Based on the degree of interest (DOI), it describes visualizations and algorithms, focusing on concepts and a test implementation without providing extensive practical details. (3) Seeks to facilitate the visual exploration of hierarchical graphs through recommendations. (6) Focuses on visual perception and user assistance with visual cues. (7) Provides recommendations through visual cues and DOIs, guiding the user to nodes of interest. | |||||
Palomino et al. [15] | X | X | X | X | (2) Implements a guided tool to find cyclical patterns. (3) Setting parameters, especially the cycle length. (5) Spiral implementation to detect cyclic patterns and star-like glyphs. (8) Visual signals on the cycle length parameter. | |||||
Collins et al. [7] | X | X | X | X | X | X | (1) Proposes a conceptual framework, discussing abstract concepts such as types of guidance, objectives, and requirements. (3) The guide helps analysts understand unknown data, identify patterns, and build mental models. (6) Pattern extraction, bias mitigation, and cognitive load reduction. (7), (8), (9) The framework recommends using all three degrees of guidance. | |||
Han et al. [19] | X | X | X | X | X | X | (1) Presents a conceptual framework for designing guides in visual analytics (VA) based on decision points and decision support (MCDA), with emphasis on the design and implementation stages, but includes an initial prototype as a practical example. (4) Defines guidance as support at decision points using MCDA to evaluate alternatives. (5) Presents a cognitive bias because it focuses on reducing cognitive load, mitigating biases, and adapting the guide to the user’s specific needs. (6) Includes technical aspects such as integration with GIS and MCDA. (7) Alternatives are evaluated using multicriteria decision analysis (MCDA). (8) Provides an overview by identifying decision points. | |||
Perez et al. [20] | X | X | X | X | X | X | X | X | (1) Proposes a model, a typology, and a conceptual analysis of guidance tasks in mixed visual analysis environments. (3) Focuses on supporting data exploration and understanding. (4) Considers decision-making in analytical contexts. (5) Adapts user–system interactions. (6) Task decomposition and guidance systems. (7), (8), (9) The model considers the three degrees of guidance. |
Reference | (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | (9) | Total |
---|---|---|---|---|---|---|---|---|---|---|
Garcia et al. [22] | X | X | X | X | X | 5 | ||||
Fan et al. [5] | X | X | X | X | X | X | 6 | |||
Maharana et al. [24] | X | X | X | X | X | 5 | ||||
Ccetin et al. [23] | X | X | X | X | X | X | 6 | |||
Mallikharjuna et al. [25] | X | X | X | X | 4 | |||||
Palomino et al. [21] | X | X | X | X | X | 5 | ||||
Total | 5 | 1 | 6 | 6 | 3 | 3 | 5 | 1 | 1 | 31 |
Imputation Technique | Emmanuel et al. [26], 2021 | Joel et al. [28], 2022 | Thomas et al. [29], 2021 | Hasan et al. [27], 2021 | Total |
---|---|---|---|---|---|
K-nearest neighbor (KNN) | X | X | X | X | 4 |
Random forest regressor | X | X | X | 3 | |
Mean | X | X | X | 3 | |
Median | X | X | X | 3 | |
Linear regression | X | X | X | 3 | |
Additive regression | X | X | 2 | ||
Polynomial regression | X | 1 | |||
Decision trees | X | X | 2 | ||
Support vector machine (SVM) | X | X | 2 | ||
Clustering | X | X | 2 | ||
Ensemble method | X | 1 | |||
Mode | X | 1 | |||
LOCF (Last Observation Carried Forward) | X | 1 | |||
NOCB (Next Observation Carried Backward) | X | 1 | |||
Hot-deck | X | 1 | |||
Bayesian PCA (Bayesian principal component analysis) | X | 1 |
Group | Task | Code |
---|---|---|
Preprocessing | Data Quality | T01 |
Data Reduction | T02 | |
Time Series Behavior | Time Series Behavior | T03 |
Subtask | Code |
---|---|
Null Values | S01 |
Outliers | S02 |
Normalization | S03 |
Transformation | S04 |
Data Reduction | S05 |
Trend | S06 |
Seasonality | S07 |
Cycle | S08 |
Noise | S09 |
Algorithms | Code |
---|---|
Rolling Mean and Moving Median | A01 |
Decision Tree | A02 |
Stochastic Gradient Boosting | A03 |
Robust Locally Weighted Regression | A04 |
Random Forest Regressor | A05 |
Legendre Polynomials | A06 |
K Nearest Neighbor with Bagging Improvement (KNN) | A07 |
Interquartile Range (IQR) | A08 |
Scoring Method (Z-score) | A09 |
StandardScaler | A10 |
MinMaxScaler | A11 |
Robust-Scale | A12 |
Differencing | A13 |
Logarithm | A14 |
Quadratic Transformation | A15 |
Square Root Transformation | A16 |
Linear Transformation | A17 |
Principal Component Analysis (PCA) and Correlation | A18 |
Factor Analysis | A19 |
Seasonal Decompose | A20 |
Fourier Transform | A21 |
Visual Interfaces | Information |
---|---|
I01 ScatterPlot | Statistics: Bivariate analysis |
I02 Lineplot | Time series, behavior |
I03 Spiral | Univariate cyclicality |
I04 BoxPlot | Statistics: Data distribution |
I05 Histogram | Statistics: Bivariate analysis, |
data distribution | |
I06 Heatmap | Explainability of null values, outliers |
I07 Star Glyph | Multivariate relationship |
I08 Dynamic graph with hierarchical structure | Workflow |
I09 Bar Graph | Normalization, transformation |
I10 Correlation matrix | Statistics, reduction of |
dimensionality |
Group | Task | Subtask | Algorithms |
---|---|---|---|
G1 | T01 Data Quality | S01 Null values | A01 Rolling Mean and Moving Median |
A02 Decision Tree | |||
A03 Stochastic Gradient Boosting | |||
A04 Robust Locally Weighted Regression | |||
A05 Random Forest Regressor | |||
A06 Legendre Polynomials | |||
A07 Bagged k-Nearest Neighbor (KNN) | |||
S02 Outliers | A08 Interquartile Range (IQR) | ||
A09 Scoring Method (Z-score) | |||
S03 Normalization | A10 StandardScaler | ||
A11 MinMaxScaler | |||
A12 MaxAbsScaler | |||
A13 Robust-scale | |||
S04 Transformation | A15 Differencing | ||
A16 Logarithm | |||
A17 Quadratic Transformation | |||
A18 Square Root Transformation | |||
A19 Linear Transformation | |||
T02 Data Reduction | S05 Dim.Reduction | A20 Principal Component Analysis (PCA) | |
A21 Factor Analysis | |||
G2 | T03 Data Behavior | Trend | |
Cyclicality | A22 Fourier Transform | ||
Seasonality | |||
Noise | Seasonal Decompose |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 2 | 5 | 6 | NaN | 2 | 1 | NaN | 4 | 6 | NaN | NaN | 9 | NaN | 4 | NaN |
Complete Data | Null Data | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 5 | 6 | 8 | 9 | 12 | 14 | 15 | 4 | 7 | 10 | 11 | 13 | |
3 | 2 | 5 | 6 | 2 | 1 | 4 | 6 | 9 | 4 | NaN | NaN | NaN | NaN | NaN | NaN |
0 | 1 | 2 | 3 | 5 | 6 | 8 | 9 | 12 | 14 | 15 | 4 | 7 | 10 | 11 | 13 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 2 | 5 | 6 | 2 | 1 | 4 | 6 | 9 | 4 | NaN | NaN | NaN | NaN | NaN | NaN |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Valdivia, F.d.L.P.; Baca, H.A.H. Guidance for Interactive Visual Analysis in Multivariate Time Series Preprocessing. Sensors 2025, 25, 5617. https://doi.org/10.3390/s25185617
Valdivia FdLP, Baca HAH. Guidance for Interactive Visual Analysis in Multivariate Time Series Preprocessing. Sensors. 2025; 25(18):5617. https://doi.org/10.3390/s25185617
Chicago/Turabian StyleValdivia, Flor de Luz Palomino, and Herwin Alayn Huillcen Baca. 2025. "Guidance for Interactive Visual Analysis in Multivariate Time Series Preprocessing" Sensors 25, no. 18: 5617. https://doi.org/10.3390/s25185617
APA StyleValdivia, F. d. L. P., & Baca, H. A. H. (2025). Guidance for Interactive Visual Analysis in Multivariate Time Series Preprocessing. Sensors, 25(18), 5617. https://doi.org/10.3390/s25185617