Hamming Diversification Index: A New Clustering-Based Metric to Understand and Visualize Time Evolution of Patterns in Multi-Dimensional Datasets
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe proposed methodology is interesting and useful for analyzing and predicting trends in Big Data sets. But there are some comments:
1) For further methodology improvement, there is a need to present a statement of the research problem and the algorithm for its solution in a formalized form (given - find + algorithm). Taking into account this gap, it is not very clear from the text of the paper what the modification of the Hamming distance is.
2) Some metrics, including volatility, variability, and distance, are not represented in a formalized form at the theoretical level; this is negative because their meaning is not fully understood.
3) The tables and figures are not traditionally signed; it is better to move the description of the results presented in them to the main text.
Author Response
Reviewer 1
Comment 1: For further methodology improvement, there is a need to present a statement of the research problem and the algorithm for its solution in a formalized form (given - find + algorithm). Taking into account this gap, it is not very clear from the text of the paper what the modification of the Hamming distance is.
Response 1:
We thank the reviewer for this valuable feedback regarding the formalization of our methodology. In response, we have made several key enhancements to address the identified gaps. First, we added a comprehensive formalized problem statement in Section 1.1 (after line 114) that clearly articulates the given inputs (entity-time-feature data matrix, statistical boundaries), the objectives to find (categorical trajectories, diversification indices, distance metrics), and the research goal using standard mathematical notation. Second, we inserted Algorithm 1 in Section 4.2 (after line 308) that provides a step-by-step algorithmic framework with computational complexity analysis, making the methodology reproducible and implementable. Third, we significantly clarified the modification to the Hamming distance by explicitly contrasting the standard binary formulation (δ(xi,yi) = 1 if xi ≠ yi, 0 otherwise) with our ordinal-aware modification (|xi - yi|) that preserves categorical proximity relationships. The modified distance recognizes that transitioning between adjacent categories (e.g., 1→2) represents smaller changes than distant categorical shifts (e.g., 1→5), thus providing more nuanced similarity assessment for ordinal categorical trajectories. These additions establish the rigorous mathematical foundation requested while maintaining the interpretative accessibility that makes our framework suitable for policy applications.
Comment 2: Some metrics, including volatility, variability, and distance, are not represented in a formalized form at the theoretical level; this is negative because their meaning is not fully understood.
Response 2: We appreciate the reviewer's observation regarding the informal presentation of key metrics. In response, we have comprehensively formalized all trajectory metrics at the theoretical level with precise mathematical definitions, theoretical properties, and computational algorithms to ensure reproducibility and eliminate interpretative ambiguity.
Core Distance Metrics Formalization: We have rigorously formalized the modified Hamming distance as the central analytical tool. While the standard Hamming distance employs binary categorical differences (δ(xi,yi) = 1 if xi ≠ yi, 0 otherwise), our ordinal-aware modification preserves categorical proximity relationships through the formulation d(Ti,Tj) = Σt=1^m |ci,t - cj,t|, where ci,t represents the categorical assignment for entity i at time t. This modification recognizes that adjacent categorical transitions (e.g., 1→2) represent fundamentally different magnitudes of change than distant categorical shifts (e.g., 1→5), providing more nuanced similarity assessment for trajectories derived from continuous data through statistical transformation.
Comprehensive Trajectory Metrics Framework: We have established formal mathematical definitions for all eight trajectory metrics that comprise our analytical framework:
Volatility (Categorical Instability): Formalized as the standard deviation of categorical transitions: σv = √[(1/(n-1)) Σi=2^n (|ci - ci-1| - μ)²], where μ = (1/(n-1)) Σi=2^n |ci - ci-1| represents the mean absolute categorical change. This captures the degree of unpredictability in categorical state transitions, providing insights into institutional stability and adaptive capacity.
Variability (Categorical Range): Defined as the span of categorical states experienced: V = max(ci) - min(ci) for i ∈ {1,2,...,n}. This metric quantifies the breadth of evolutionary trajectories, distinguishing between entities that traverse multiple categorical states versus those operating within constrained ranges.
Distance Metrics: We establish both regular Hamming distance dH(x,y) = Σi=1^n δ(xi,yi) and our modified formulation dMH(x,y) = Σi=1^n |xi - yi|, with formal proof of theoretical properties including non-negativity, symmetry, triangle inequality, and ordinal preservation, ensuring mathematical rigor for categorical trajectory comparison.
Persistence (Categorical Stability): Formalized as P = [Σi=2^n 1[ci = ci-1]]/(n-1), where 1[·] represents the indicator function. This measures the proportion of time spent maintaining identical categorical states, providing insights into institutional inertia and policy continuity.
Trend Strength (Directional Consistency): Defined as TS = |β| where β represents the slope coefficient from linear regression ci = α + βti + εi. This captures systematic directional evolution independent of specific categorical values, enabling comparison of trend magnitudes across different evolutionary patterns.
Trajectory Changes (Directional Reversals): Formalized as TC = Σi=3^n 1[sign(ci - ci-1) ≠ sign(ci-1 - ci-2)], quantifying the frequency of directional reversals in categorical movement. This metric identifies oscillatory behavior and strategic pivot points in evolutionary trajectories.
Number of Steps (Discrete Transitions): Defined as S = Σi=2^n 1[ci ≠ ci-1], counting discrete categorical state changes throughout the observation period. This provides a fundamental measure of trajectory dynamism and adaptive frequency.
Extreme Visits (Boundary Exposure): Formalized as EV = Σi=1^n (1[ci = 0] + 1[ci = 5]), counting observations in extreme categorical boundaries. This captures exposure to crisis conditions or exceptional performance periods, providing insights into risk management and exceptional state navigation.
Theoretical Properties and Computational Framework: We have established rigorous theoretical foundations for all metrics, including bounded ranges (volatility, persistence ∈ [0,1]; steps, trajectory changes, extreme visits ∈ [0,n]; range span ∈ [0,5]), computational complexity (O(n²m + nmk) for complete framework), and mathematical properties ensuring consistent interpretation across diverse applications. Each metric is grounded in established literature on temporal pattern mining [Moskovitch & Shahar, 2015; Batal et al., 2013] and categorical distance measures [Ahmad & Hashmi, 2022; Cha, 2007].
Algorithm Integration: We have integrated all formalized metrics into Algorithm 1 (Section 4.2), providing step-by-step computational procedures with complexity analysis. The algorithm ensures reproducible implementation while maintaining interpretative clarity for policy applications.
Empirical Validation Framework: The formalized metrics are demonstrated through comprehensive analysis of both synthetic data (Section 3) and real-world government debt trajectories (Section 5), with complete computational results presented in standardized tabular format. Each metric's discriminatory power is validated through systematic comparison across entities with known evolutionary patterns, establishing empirical foundations for broader applications.
Integration with Distance-Based Similarity: All trajectory metrics are integrated with our distance-based similarity framework, enabling comprehensive characterization of temporal evolution patterns. The modified Hamming distance serves as the foundation for pairwise trajectory comparison, while individual metrics provide detailed insights into specific aspects of evolutionary dynamics.
These comprehensive formalizations address the reviewer's concern by providing rigorous mathematical foundations that eliminate interpretative ambiguity while maintaining accessibility for policy-oriented research. The theoretical framework now supports reproducible implementation across diverse applications while preserving the intuitive interpretability that makes the Hamming Diversification Index suitable for comparative analysis in social science and policy research contexts.
Comment 3: The tables and figures are not traditionally signed; it is better to move the description of the results presented in them to the main text.
Response 3:
We thank the reviewer for this important formatting observation. In response, we have systematically moved all descriptive content from table and figure captions into the main text, following traditional academic presentation standards. Specifically, we relocated detailed descriptions for Table 1 (synthetic dataset), Figure 1 (boxplot analysis), Table 2 (ordinal mapping), Table 3 (categorical results), Figure 2 (trajectory visualization), Tables 4-7 (distance analysis), Figure 3 (country trajectories), and Table 8 (trajectory metrics) into the corresponding text sections, preserving analytical content while maintaining concise, descriptive captions. Each moved description now provides comprehensive interpretation of results within the narrative flow, enhancing readability and ensuring that substantive analytical insights are integrated into the main argument rather than relegated to caption text. The revised format improves manuscript flow while maintaining all essential interpretative content for understanding the methodology's effectiveness and empirical validation.
Reviewer 2 Report
Comments and Suggestions for AuthorsDearest Authors
after careful evaluation of your submitted Manuscript, I found the work interesting, straightforward and very focused on the results. Nonetheless, a proofreading could be useful to detect and correct some minor typos.
A single question arose, which I'm reporting below.
Best Regards
1) The difference between Hamming distance and Weighted Hamming distance is clear in the theoretical framework, but it's not in the real world application. Specifically, which is the entity of wi for each Country? Since the majority of the subsequent analysis is devoted to the comparison between the unweighted and weighted Hamming distance, this aspect should be reported in detail.
Author Response
Comment 1: The difference between Hamming distance and Weighted Hamming distance is clear in the theoretical framework, but it's not in the real world application. Specifically, which is the entity of wi for each Country? Since the majority of the subsequent analysis is devoted to the comparison between the unweighted and weighted Hamming distance, this aspect should be reported in detail.
Response 1: We thank the reviewer for this important clarification request, which highlights a critical notational ambiguity in our empirical presentation. Upon careful review, we acknowledge that our notation created confusion between temporal weighting and ordinal structure modification. This warranted an explicit clarification of our distance metric implementation.
Clarification of Notation
The distinction between H(X, Y) and HW(X, Y), as reported in Tables 4, 6, and 7, does not refer to temporal weighting schemes. Rather, it reflects two fundamentally different approaches to categorical distance calculation. To eliminate ambiguity, we have revised our notation and terminology accordingly:
- H(X, Y): Refers to the regular Hamming distance, which uses a binary categorical comparison. Here, δ(xᵢ, yᵢ) = 1 if xᵢ ≠ yᵢ and 0 if xᵢ = yᵢ. This approach treats all categorical differences as equal, irrespective of ordinal proximity.
- HW(X, Y): Denotes a modified Hamming distance that is ordinal-aware. It incorporates the absolute difference |xᵢ − yᵢ|, capturing the magnitude of categorical shifts. The "W" indicates weighting by ordinal distance—not by time.
No Temporal Weighting Applied
Crucially, our empirical analysis does not apply any country-specific or time-specific weighting scheme (wᵢ). Each year from 1960 to 2022 contributes equally to trajectory comparisons across all country pairs. This equal temporal weighting ensures that the analysis reflects long-term patterns without bias toward any historical period.
Mathematical Formulation
For any country pair (i, j) over time period T = {1960, 1961, ..., 2022}, the distance calculations are defined as follows:
- Regular Hamming distance:
d_H(Tᵢ, Tⱼ) = (1 / |T|) ∑ₜ∈T δ(cᵢ,ₜ, cⱼ,ₜ) - Modified Hamming distance (ordinal-aware):
d_MH(Tᵢ, Tⱼ) = (1 / 2|T|) ∑ₜ∈T |cᵢ,ₜ − cⱼ,ₜ|
Here, cᵢ,ₜ represents the categorical debt level of country i at time t. No temporal weights wᵢ,ₜ are used.
(content added right after the debt trajectories)
Reviewer 3 Report
Comments and Suggestions for AuthorsThe transformation from continuous to ordinal categories, the method lacks validation of how well the discretization preserves important information.
Metrics like “steps,” “trajectory changes,” and “volatility” are not benchmarked. What is high or low for these measures? It is not clear whether data smoothing, detrending, or noise filtering was applied before categorization.
In general, paper lacks in many aspects and I do not recommend it.
Author Response
We have done a major revision to address the specific issues raised by the reviewers. We are enclosing our detailed responses to reviewers.
Reviewer 1:
Comment 1: For further methodology improvement, there is a need to present a statement of the research problem and the algorithm for its solution in a formalized form (given - find + algorithm). Taking into account this gap, it is not very clear from the text of the paper what the modification of the Hamming distance is.
Response 1:
We thank the reviewer for this valuable feedback regarding the formalization of our methodology. In response, we have made several key enhancements to address the identified gaps. First, we added a comprehensive formalized problem statement in Section 1.1 (after line 114) that clearly articulates the given inputs (entity-time-feature data matrix, statistical boundaries), the objectives to find (categorical trajectories, diversification indices, distance metrics), and the research goal using standard mathematical notation. Second, we inserted Algorithm 1 in Section 4.2 (after line 308) that provides a step-by-step algorithmic framework with computational complexity analysis, making the methodology reproducible and implementable. Third, we significantly clarified the modification to the Hamming distance by explicitly contrasting the standard binary formulation (δ(xi,yi) = 1 if xi ≠ yi, 0 otherwise) with our ordinal-aware modification (|xi - yi|) that preserves categorical proximity relationships. The modified distance recognizes that transitioning between adjacent categories (e.g., 1→2) represents smaller changes than distant categorical shifts (e.g., 1→5), thus providing more nuanced similarity assessment for ordinal categorical trajectories. These additions establish the rigorous mathematical foundation requested while maintaining the interpretative accessibility that makes our framework suitable for policy applications.
Comment 2: Some metrics, including volatility, variability, and distance, are not represented in a formalized form at the theoretical level; this is negative because their meaning is not fully understood.
Response 2: We appreciate the reviewer's observation regarding the informal presentation of key metrics. In response, we have comprehensively formalized all trajectory metrics at the theoretical level with precise mathematical definitions, theoretical properties, and computational algorithms to ensure reproducibility and eliminate interpretative ambiguity.
Core Distance Metrics Formalization: We have rigorously formalized the modified Hamming distance as the central analytical tool. While the standard Hamming distance employs binary categorical differences (δ(xi,yi) = 1 if xi ≠ yi, 0 otherwise), our ordinal-aware modification preserves categorical proximity relationships through the formulation d(Ti,Tj) = Σt=1^m |ci,t - cj,t|, where ci,t represents the categorical assignment for entity i at time t. This modification recognizes that adjacent categorical transitions (e.g., 1→2) represent fundamentally different magnitudes of change than distant categorical shifts (e.g., 1→5), providing more nuanced similarity assessment for trajectories derived from continuous data through statistical transformation.
Comprehensive Trajectory Metrics Framework: We have established formal mathematical definitions for all eight trajectory metrics that comprise our analytical framework:
Volatility (Categorical Instability): Formalized as the standard deviation of categorical transitions: σv = √[(1/(n-1)) Σi=2^n (|ci - ci-1| - μ)²], where μ = (1/(n-1)) Σi=2^n |ci - ci-1| represents the mean absolute categorical change. This captures the degree of unpredictability in categorical state transitions, providing insights into institutional stability and adaptive capacity.
Variability (Categorical Range): Defined as the span of categorical states experienced: V = max(ci) - min(ci) for i ∈ {1,2,...,n}. This metric quantifies the breadth of evolutionary trajectories, distinguishing between entities that traverse multiple categorical states versus those operating within constrained ranges.
Distance Metrics: We establish both regular Hamming distance dH(x,y) = Σi=1^n δ(xi,yi) and our modified formulation dMH(x,y) = Σi=1^n |xi - yi|, with formal proof of theoretical properties including non-negativity, symmetry, triangle inequality, and ordinal preservation, ensuring mathematical rigor for categorical trajectory comparison.
Persistence (Categorical Stability): Formalized as P = [Σi=2^n 1[ci = ci-1]]/(n-1), where 1[·] represents the indicator function. This measures the proportion of time spent maintaining identical categorical states, providing insights into institutional inertia and policy continuity.
Trend Strength (Directional Consistency): Defined as TS = |β| where β represents the slope coefficient from linear regression ci = α + βti + εi. This captures systematic directional evolution independent of specific categorical values, enabling comparison of trend magnitudes across different evolutionary patterns.
Trajectory Changes (Directional Reversals): Formalized as TC = Σi=3^n 1[sign(ci - ci-1) ≠ sign(ci-1 - ci-2)], quantifying the frequency of directional reversals in categorical movement. This metric identifies oscillatory behavior and strategic pivot points in evolutionary trajectories.
Number of Steps (Discrete Transitions): Defined as S = Σi=2^n 1[ci ≠ ci-1], counting discrete categorical state changes throughout the observation period. This provides a fundamental measure of trajectory dynamism and adaptive frequency.
Extreme Visits (Boundary Exposure): Formalized as EV = Σi=1^n (1[ci = 0] + 1[ci = 5]), counting observations in extreme categorical boundaries. This captures exposure to crisis conditions or exceptional performance periods, providing insights into risk management and exceptional state navigation.
Theoretical Properties and Computational Framework: We have established rigorous theoretical foundations for all metrics, including bounded ranges (volatility, persistence ∈ [0,1]; steps, trajectory changes, extreme visits ∈ [0,n]; range span ∈ [0,5]), computational complexity (O(n²m + nmk) for complete framework), and mathematical properties ensuring consistent interpretation across diverse applications. Each metric is grounded in established literature on temporal pattern mining [Moskovitch & Shahar, 2015; Batal et al., 2013] and categorical distance measures [Ahmad & Hashmi, 2022; Cha, 2007].
Algorithm Integration: We have integrated all formalized metrics into Algorithm 1 (Section 4.2), providing step-by-step computational procedures with complexity analysis. The algorithm ensures reproducible implementation while maintaining interpretative clarity for policy applications.
Empirical Validation Framework: The formalized metrics are demonstrated through comprehensive analysis of both synthetic data (Section 3) and real-world government debt trajectories (Section 5), with complete computational results presented in standardized tabular format. Each metric's discriminatory power is validated through systematic comparison across entities with known evolutionary patterns, establishing empirical foundations for broader applications.
Integration with Distance-Based Similarity: All trajectory metrics are integrated with our distance-based similarity framework, enabling comprehensive characterization of temporal evolution patterns. The modified Hamming distance serves as the foundation for pairwise trajectory comparison, while individual metrics provide detailed insights into specific aspects of evolutionary dynamics.
These comprehensive formalizations address the reviewer's concern by providing rigorous mathematical foundations that eliminate interpretative ambiguity while maintaining accessibility for policy-oriented research. The theoretical framework now supports reproducible implementation across diverse applications while preserving the intuitive interpretability that makes the Hamming Diversification Index suitable for comparative analysis in social science and policy research contexts.
Comment 3: The tables and figures are not traditionally signed; it is better to move the description of the results presented in them to the main text.
Response 3:
We thank the reviewer for this important formatting observation. In response, we have systematically moved all descriptive content from table and figure captions into the main text, following traditional academic presentation standards. Specifically, we relocated detailed descriptions for Table 1 (synthetic dataset), Figure 1 (boxplot analysis), Table 2 (ordinal mapping), Table 3 (categorical results), Figure 2 (trajectory visualization), Tables 4-7 (distance analysis), Figure 3 (country trajectories), and Table 8 (trajectory metrics) into the corresponding text sections, preserving analytical content while maintaining concise, descriptive captions. Each moved description now provides comprehensive interpretation of results within the narrative flow, enhancing readability and ensuring that substantive analytical insights are integrated into the main argument rather than relegated to caption text. The revised format improves manuscript flow while maintaining all essential interpretative content for understanding the methodology's effectiveness and empirical validation.
Reviewer 2
Comment 1: The difference between Hamming distance and Weighted Hamming distance is clear in the theoretical framework, but it's not in the real world application. Specifically, which is the entity of wi for each Country? Since the majority of the subsequent analysis is devoted to the comparison between the unweighted and weighted Hamming distance, this aspect should be reported in detail.
Response 1: We thank the reviewer for this important clarification request, which highlights a critical notational ambiguity in our empirical presentation. Upon careful review, we acknowledge that our notation created confusion between temporal weighting and ordinal structure modification. This warranted an explicit clarification of our distance metric implementation.
Clarification of Notation
The distinction between H(X, Y) and HW(X, Y), as reported in Tables 4, 6, and 7, does not refer to temporal weighting schemes. Rather, it reflects two fundamentally different approaches to categorical distance calculation. To eliminate ambiguity, we have revised our notation and terminology accordingly:
- H(X, Y): Refers to the regular Hamming distance, which uses a binary categorical comparison. Here, δ(xᵢ, yᵢ) = 1 if xᵢ ≠ yᵢ and 0 if xᵢ = yᵢ. This approach treats all categorical differences as equal, irrespective of ordinal proximity.
- HW(X, Y): Denotes a modified Hamming distance that is ordinal-aware. It incorporates the absolute difference |xᵢ − yᵢ|, capturing the magnitude of categorical shifts. The "W" indicates weighting by ordinal distance—not by time.
No Temporal Weighting Applied
Crucially, our empirical analysis does not apply any country-specific or time-specific weighting scheme (wᵢ). Each year from 1960 to 2022 contributes equally to trajectory comparisons across all country pairs. This equal temporal weighting ensures that the analysis reflects long-term patterns without bias toward any historical period.
Mathematical Formulation
For any country pair (i, j) over time period T = {1960, 1961, ..., 2022}, the distance calculations are defined as follows:
- Regular Hamming distance:
d_H(Tᵢ, Tⱼ) = (1 / |T|) ∑ₜ∈T δ(cᵢ,ₜ, cⱼ,ₜ) - Modified Hamming distance (ordinal-aware):
d_MH(Tᵢ, Tⱼ) = (1 / 2|T|) ∑ₜ∈T |cᵢ,ₜ − cⱼ,ₜ|
Here, cᵢ,ₜ represents the categorical debt level of country i at time t. No temporal weights wᵢ,ₜ are used.
(content added right after the debt trajectories)
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe reply is satisfactory.
I will endorse publicatino.
My best regards
Reviewer 3 Report
Comments and Suggestions for AuthorsI have positive view on this.