Descriptive Analysis and Clustering-Based Productive Scale Segmentation of Colombian Transitory Crop Production: A Departmental-Level Approach

Muñoz, Norbey D.; Barón-Velandia, Julio; Vanegas-Ayala, Sebastian-Camilo

doi:10.3390/agriculture16090980

Open AccessArticle

Descriptive Analysis and Clustering-Based Productive Scale Segmentation of Colombian Transitory Crop Production: A Departmental-Level Approach

by

Norbey D. Muñoz

^1,2,*

,

Julio Barón-Velandia

¹

and

Sebastian-Camilo Vanegas-Ayala

¹

Systems Engineering Program, Faculty of Engineering, Universidad Distrital Francisco José de Caldas, Bogotá 111611, Colombia

²

Systems Engineering Program, Faculty of Engineering and Basic Sciences, Fundación Universitaria Los Libertadores, Bogotá 111221, Colombia

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(9), 980; https://doi.org/10.3390/agriculture16090980

Submission received: 17 March 2026 / Revised: 8 April 2026 / Accepted: 10 April 2026 / Published: 29 April 2026

(This article belongs to the Section Agricultural Economics, Policies and Rural Management)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Colombian transitory crop production exhibits marked structural heterogeneity across department–crop combinations, yet empirical characterizations of productive scale at the subnational level remain scarce. This study presents a descriptive analysis and clustering-based productive scale segmentation of Colombian transitory crops at the departmental level for the period 2007–2024. Data from the Evaluaciones Agropecuarias Municipales (EVA) were processed through a structured CRISP-DM pipeline comprising preprocessing of 347,141 records, departmental aggregation, and engineering of five clustering features: average production, average planted area, number of active periods, and temporal and spatial Herfindahl–Hirschman indices. K-Means clustering (

k = 3

) was applied to a final dataset of 490 department–crop pairs and validated based on a global silhouette coefficient of 0.888. The segmentation reveals a markedly asymmetric productive structure: 93.7% small scale (459 pairs), 5.3% medium scale (26 pairs), and 1.0% large scale (5 pairs), with natural breakpoints at approximately 35,386 t and 275,959 t. Large-scale production is concentrated in papa (Cundinamarca, Boyacá, Nariño) and arroz (Casanare, Tolima). Clustering demonstrated quantitative superiority over quartile-based classification, reducing the within-group coefficient of variation from 223.9% to 30.6% for the upper segment. The methodology is replicable across national agricultural statistics systems, and the processed dataset is publicly available under CC BY 4.0.

Keywords:

transitory crops; K-Means clustering; productive scale; Colombia; departmental analysis; EVA; food security; agricultural characterization; unsupervised learning

1. Introduction

Agricultural production is a fundamental pillar of economic development and food security worldwide [1]. In Colombia, this sector plays a strategic role in economic growth, rural employment, and territorial development [2]. In 2024, agricultural activities recorded a growth rate of 8.1%, positioning the sector as one of the main drivers of national economic recovery [3]. According to the 2023 National Agricultural Survey, approximately 4.7 million hectares, equivalent to 9.4% of the national territory, are dedicated to agricultural use [4]. Within this landscape, transitory crops occupy a central place: cultivated in two semi-annual cycles per year, they are critical to the short-term food supply, sensitive to climatic and market fluctuations, and constitute the productive base of rural communities in all 32 Colombian departments [5,6].

Agricultural systems in developing countries generally exhibit a highly heterogeneous productive structure, where output volumes, cultivated areas, and yield levels vary substantially between regions, crops, and productive units, reflecting differences in agroecological conditions, technological adoption, and territorial productive capacity [7,8]. Colombia is no exception: transitory crops display this same heterogeneity at the departmental level, posing a fundamental analytical challenge. When applied across such structurally diverse productive systems, uniform approaches misrepresent the sector, obscure relevant patterns, and limit the usefulness of any subsequent analysis. Characterizing this structure rigorously is therefore not a minor preliminary step but a necessary analytical contribution to the development of evidence-based agricultural frameworks [9,10]. This structural heterogeneity also has direct economic implications—for resource allocation, market access, and the design of differentiated policy instruments—that can only be properly addressed once the productive structure of the sector has been empirically characterized at the appropriate level of disaggregation. Recent contributions in the modeling of complex economic systems have emphasized the need to move beyond static and aggregate representations toward structurally consistent frameworks that capture heterogeneity, interdependence, and dynamic behavior in production systems [11]. The present study operates at a different but complementary analytical scale: rather than modeling system-level dynamics, it constructs the empirical disaggregated characterization that such frameworks require as an input.

Unsupervised machine learning and clustering methods have become an established tool in agricultural analytics, with applications spanning three broad methodological objectives. The first encompasses spatial segmentation of productive regions and management zones, where clustering has been used to delineate homogeneous areas for precision agriculture interventions and land suitability assessment [12,13]; these studies typically operate at the field or sub-regional level and prioritize spatial contiguity over temporal depth. The second involves classification of crops and land cover from remote sensing or UAV imagery [14], where the analytical unit is the pixel or image object and the temporal dimension is often limited to a single season. The third groups observational units—farms, stations, or administrative units—by behavioral, climatic, or productive similarity over time [15], an objective more closely aligned with the present study. Across all three applications, a consistent finding emerges: prior segmentation reduces within-group heterogeneity and improves both the performance and interpretability of subsequent predictive models, particularly when the underlying data exhibit marked structural asymmetry [16,17,18]. Despite this shared benefit, the three approaches differ substantially in scope and analytical resolution: remote sensing applications prioritize spatial pattern recognition at fine scales; production-based segmentation studies typically address a single crop or a limited geographic area without a prior characterization phase that establishes the full productive structure of the sector; and regional studies often rely on aggregate indicators that smooth out the heterogeneity they seek to characterize. In the Colombian context, this methodological gap is particularly evident: existing work either focuses on individual crops or regions [19,20,21,22], applies predictive models without a structured productive characterization phase [23], or relies on national aggregates that obscure departmental-level variation—leaving the full spectrum of transitory crop department–crop combinations uncharacterized as a system. None of these contributions addresses the question of how productive output is distributed across the complete set of department–crop combinations at the national level, nor whether that distribution can be segmented into empirically grounded categories with practical interpretive value. To the best of the authors’ knowledge, no prior study has produced a systematic, data-driven characterization of productive scale across the full spectrum of transitory crops and departments in Colombia, representing a gap that the present study addresses for the first time using nearly two decades of open EVA data.

The Evaluaciones Agropecuarias Municipales (EVA) published by the Colombian Ministry of Agriculture and Rural Development (MADR) represent a valuable and underutilized resource for this purpose [24,25]. The database is drawn from two complementary sources covering the periods 2007–2018 and 2019–2024, which together span nearly two decades with semi-annual granularity, national coverage across all 32 departments, and variables including planted area, harvested area, production, and yield. Together, these sources constitute the most comprehensive empirical dataset available in Colombia for longitudinal analysis of departmental agricultural production. Despite its historical depth and breadth, this dataset has not been systematically exploited to build a rigorous characterization of the transitory crop sector through machine learning techniques. When systematically consolidated, cleaned, and transformed, this dataset enables the development of reproducible and scalable analytical frameworks aligned with open data principles and with the Sustainable Development Goals, particularly SDG 2 (Zero Hunger) and SDG 12 (Responsible Consumption and Production) [26,27,28].

This article pursues three objectives applied to Colombian transitory crop production at the departmental level for the period 2007–2024, guided by two research questions: (1) How is productive output distributed across department–crop combinations in the Colombian transitory crop system over the period 2007–2024? (2) Does clustering-based segmentation produce more internally homogeneous and interpretable productive scale categories than quartile-based classification? The first objective is to characterize the productive structure of the sector through a comprehensive descriptive analysis of production volume, planted area, yield, spatial coverage, and temporal continuity across the full set of department–crop combinations registered in the EVA, thereby establishing the empirical foundation for the segmentation phase. Starting from raw EVA data, we document a preprocessing and quality control protocol that consolidates and refines the information to produce 490 department–crop pairs that meet a minimum temporal continuity criterion. The second objective is to produce an empirically grounded productive scale segmentation of these 490 pairs using K-Means clustering, resulting in three interpretable categories (small, medium, and large scale) validated against a quartile-based classification as a baseline. The third objective is to make the resulting segmentation openly available as a documented dataset on Zenodo, supporting subsequent research and evidence-based agricultural policy design in Colombia. Together, these outputs provide a concrete empirical contribution to national agricultural analytics, grounded in rigorous data integration, transparent preprocessing, and national-scale application of established analytical methods [29,30]. The segmentation of productive scale developed here further serves as the analytical foundation for subsequent work on temporal pattern analysis and forecasting of the highest-output department–crop pairs.

The remainder of the article is organized as follows. Section 2 describes the data sources, the preprocessing protocol, and the clustering methodology. Section 3 presents the descriptive findings and the segmentation of the productive scale, including the validation against the quartile-based approach. Section 4 discusses the implications of the results and situates them within the broader literature. Section 5 concludes with a summary of contributions and directions for future research.

2. Materials and Methods

This study follows the CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology [31], a standard process framework widely adopted in data science and data mining projects for its iterative structure, problem-oriented design and independence from specific tools or domains [31]. This methodology has been successfully applied in clustering-based segmentation studies in various domains [32,33,34], including cases involving open and public geographical and productive data [35]. The six CRISP-DM phases map onto the stages of this study as follows: business understanding and data understanding correspond to problem identification, source exploration, and initial descriptive analysis (Section 2.1 and Section 2.2); data preparation encompasses cleaning, imputation, aggregation, and dataset consolidation (Section 2.2); modeling covers feature engineering and the application of the clustering algorithm (Section 2.3 and Section 2.4); and evaluation includes model validation and comparison with the quartile-based classification (Section 2.4.4 and Section 2.5). Figure 1 presents the complete methodological pipeline of the study.

The data preparation process follows a structured five-stage scheme [36], illustrated in Figure 2: data cleaning (treatment of missing values, inconsistent zeros, and anomalous records), data transformation (normalization of types, formats, and nomenclature), feature construction, feature selection, and dimensionality reduction through principal component analysis. This scheme, adapted to the context of the present study, guides the methodological decisions described in the following subsections.

2.1. Data Sources

The dataset used in this study is constructed from two independent sources, both derived from the Evaluaciones Agropecuarias Municipales (EVA), published by the Colombian Ministry of Agriculture and Rural Development (MADR) [24,25]. The EVA is a national survey that annually collects, through a reconciliation mechanism involving administrative records and sectoral experts, information on planted area, harvested area, production, and yield for approximately 200 agricultural products in 1103 municipalities.

The first source covers the period 2007–2018, comprising 206,068 records and 17 variables. The second covers 2019–2024, with 141,073 records and 18 variables.

Table 1 summarizes the main characteristics of each source prior to any processing.

Since the central objective of the study is the analysis of transitory crop production, the analytical universe is restricted to records classified under this crop cycle category. Transitory crops complete their planting-to-harvest cycle in a single year, allowing them to respond more rapidly to climatic, market, and policy changes. Unlike permanent crops, whose productive dynamics are governed by long-term investment horizons and exhibit structural heterogeneity incompatible with the proposed analysis, transitory crops offer a homogeneous temporal structure directly relevant to short- and medium-term agricultural planning. Annual crops, representing 4% of the total, are excluded to maintain a strict definition aligned with the official EVA classification.

2.2. Data Preprocessing

2.2.1. Record-Level Preprocessing

Dataset consolidation.

Both sources undergo an initial review of structure, nomenclature, and data types. The variables relevant to the study are retained (department, crop, year, period, sown_area (planted_area), harvested_area, production, yield, and crop_cycle), and their names are normalized to ensure consistency between sources. Following normalization, the two datasets are concatenated into a single dataset of 347,141 records. All text values are converted to lowercase and special characters are removed. A naming inconsistency in the department column for “San Andrés y Providencia” was resolved by standardizing to a normalized form. For the crop column, normalization reduced unique values from 389 to 249, grouping only typographic differences (capitalization, accents) without affecting real semantic distinctions between crops. After applying the transitory crop cycle filter, the dataset retains 194,850 records in 32 departments and 97 crops.

Crop names are maintained in Spanish throughout this article, as they appear in the crop field of the original EVA data source and in the generated dataset. English equivalents for all crop names mentioned in the text are provided in the Abbreviations section. Likewise, the variable names used in this article correspond to their Spanish-named counterparts in the original EVA source data and in the published dataset: sown_area (area_sembrada), harvested_area (area_cosechada), production (produccion), yield (rendimiento), department (departamento), crop (cultivo), year (anio), period (periodo), and crop_cycle (ciclo_cultivo).

Intra-annual period analysis and exclusion of unclassified records.

Each EVA record corresponds to one of two semi-annual production periods (A or B). Analysis of this variable showed that 99.87% of records are correctly classified. The remaining 261 records (0.13%) carried a value equivalent to the full calendar year without specifying the intra-annual cycle, concentrated in aromatic plants (95%) and tobacco (5%) between 2019 and 2024. Since these records cannot be unambiguously assigned to either period A or B, they are excluded to preserve a homogeneous temporal structure. The dataset retains 194,589 records.

Annual record distribution, exclusion of 2006, and treatment of 2018.

The annual distribution of records is regular throughout the series, except for two years with atypical behavior. The year 2006 contains only 1215 observations—more than three standard deviations below the 2007–2024 annual mean of ≈10,800—indicating markedly partial source coverage incompatible with the construction of complete temporal series; it is therefore excluded [37]. The year 2018 contains only period A records due to the transition between the two EVA sources; it is retained, as its period A records are valid, and its atypical structure is noted in the interpretation of temporal continuity.

Zero-value analysis and imputation.

The numerical variables (sown_area, harvested_area, production, and yield) contain zero values in proportions ranging from 1.4% to 2.4% of the total records. Since some combinations of zeros are agronomically coherent (for example, positive sown_area with harvested_area, production, and yield equal to zero, interpretable as a failed sowing or total crop loss), while others constitute recording inconsistencies (for example, positive production with harvested_area equal to zero, which violates the fundamental agronomic relationship

production = harvested_area \times yield

), a systematic analysis is conducted that classifies cases into three decision groups.

Table 2 presents the identified cases, their agronomic interpretation, and the action adopted for each.

For cases where sown_area was zero in an agronomically implausible manner (cases C2 and C5), imputation is grounded in the empirical relationship between both surface variables, estimated through a simple linear regression model fitted on complete observations:

{\hat{sown_area}}_{i} = α + β \cdot harvested_{area}_{i} + ε_{i},

(1)

where

α

is the intercept,

β

the slope of the relationship and

ε_{i}

the error term. The model fitted to complete records yielded

α \approx 28

ha,

β \approx 0.87

,

R^{2} \approx 0.81

, and a Pearson correlation of

r \approx 0.90

. The small magnitude of the intercept relative to typical surface values, combined with the high observed correlation, supports the use of harvested_area as a proxy for sown_area in missing data records, constituting a conservative deterministic imputation that avoids systematic underestimation of planted area in departmental aggregates [36]. In total, the imputation process affected a very small fraction of records: sown_area was imputed in 2714 records (1.39%), harvested_area in 532 (0.27%), production in 112 (0.06%), and yield was adjusted in 356 cases (0.17%).

2.2.2. Departmental Aggregation and Dataset Consolidation

Departmental aggregation.

The original dataset has municipal granularity, with individual records per municipality, crop, year, and period. Since the objective of the study is analysis at the departmental level, the aggregation is performed by summing sown_area, harvested_area, and production for each unique combination of department, crop, year, and period. The aggregated yield is recalculated as the ratio of total production to total harvested_area for the group, preserving agronomic identity

production = harvested_area \times yield

. Rows appearing as “duplicates” after removing the municipal dimension are not eliminated prior to aggregation, as they correspond to records from different municipalities that contribute to the departmental total; their removal would underestimate departmental aggregates. The resulting dataset comprises 19,107 observations in 8 dimensions.

Review of crops with special denomination.

Semantic inspection of the crop column identifies 14 categories that do not represent an individual crop with its own temporal dynamics. (Calabacín, calabacín/calabaza, calabaza, chiraran-albahaca, flores y follajes, hibias/ocas, hortalizas varias, malanga, malanga/achín/yota/papa china/bore, otras hortalizas, otras raíces y tubérculos, plantas aromáticas, tabaco, and tabaco rubio.) For each, the number of records, departments, year coverage, and accumulated production are evaluated. The decisions adopted are:

Hortalizas varias and otras hortalizas are generic multi-species categories representing equivalent classifications across the two source periods; retained for descriptive analysis but excluded from clustering.
Tabaco and tabaco rubio are recoded forms of the same species across periods; unified under tabaco to preserve series continuity.
Plantas aromáticas appears exclusively in 2019–2024 without A/B period classification; excluded from temporal analysis.
Flores y follajes and hibias/ocas present very limited coverage (<5 departments, <10 periods) and are excluded.
Within the malanga and calabacín/calabaza families, aggregate denominations are excluded while those with specific denominations and sufficient coverage are retained.

Temporal continuity threshold.

The minimum threshold of 15 semi-annual periods per department–crop pair was selected based on a sensitivity analysis across candidate threshold values (Table 3). The analysis reveals two distinct behaviors: low thresholds (≤12 periods) retain 67–71% of pairs but cover nearly all production (99.7–99.9%), indicating the included pairs are predominantly marginal producers; above 15 periods, production coverage stabilizes around 90%, while pair coverage declines steadily. The threshold of 15 periods represents the equilibrium point: it retains 52.9% of pairs (513), keeps production coverage at 90.65%, guarantees participation of all 32 departments and 61 unique crops, and covers 95% of the top 20 most representative crops nationally. The next candidate threshold (18 periods) already falls below 50% pair coverage (49.2%) while yielding no meaningful gain in production coverage (90.46%). This threshold thus represents an explicit trade-off between the temporal quality of the computed features and the sample coverage required to preserve the structural view of the Colombian transitory crop system at the national level.

Applying the selected threshold reduces the dataset from 18,716 to 15,539 observations. After applying the decisions on specially denominated crops (removal of 5 categories, 462 records, 23 pairs, and 635,302 t—0.34% of accumulated production), the final dataset comprises 15,077 observations, 490 department–crop pairs, 56 unique crops, 32 departments, and a temporal coverage of 2007–2024, with a minimum of 15 periods per pair, a median of 35, and a maximum of 35.

2.3. Feature Engineering and Variable Selection

For each of the 490 department–crop pairs, a set of 36 features is computed to characterize their productive behavior over the temporal series. These variables are organized into five conceptual dimensions: production magnitude (level statistics on production and area), trend (direction and magnitude of long-term change), volatility (intra- and inter-annual variability of production), seasonality (asymmetry between periods A and B) and continuity and concentration (regularity of presence in the series and productive concentration indices). The magnitude, trend, and continuity variables are calculated using standard descriptive statistics and time series analysis procedures [38].

Concentration indices are computed using the Herfindahl–Hirschman index (HHI) [39], defined as:

H H I = \sum_{i = 1}^{n} s_{i}^{2},

(2)

where

s_{i}

denotes the relative share of period i in the total accumulated production (temporal HHI) or the share of department i in the national production of the crop (spatial HHI). The index takes values in (0, 1], where values close to 1 indicate a high concentration, and values close to 0 reflect a homogeneous distribution.

Variable selection for productive scale clustering.

In this study, the term productive scale refers strictly to the relative magnitude of the average production volume (production, in tonnes) of a department–crop pair over the study period. This operational definition should not be conflated with the economic concepts of returns to scale, factor productivity, or technical efficiency, which pertain to the relationship between inputs and outputs in production functions. Equally, it does not refer to the size of individual productive units or farms. The segmentation produced here classifies department–crop pairs by their observed output magnitude and structural characteristics, providing an empirical categorization that is useful for descriptive and policy purposes without making claims about the underlying economic mechanisms that determine those output levels. Of the 36 computed features, nine candidate variables are initially identified for the scale clustering, grouped into three conceptual categories: productive magnitude (total production, average production, average sown_area, average harvested_area), temporal continuity (number of periods, duration in years, percentage of completeness) and concentration (temporal HHI, spatial HHI). Analysis of the correlation matrix among these nine variables (Figure 3) reveals significant redundancies: near-perfect correlations (

| r | > 0.99

) are identified between total and average production and between average sown_area and average harvested_area; additionally, high correlations (

| r | > 0.74

) are observed between production and area variables. Within the temporal continuity group, the number of periods and the completeness percentage present a correlation of 0.78.

To avoid over-weighting redundant dimensions, one representative variable is selected per highly correlated pair: average production over total production (as it is independent of series length), average sown_area over average harvested_area (as it captures productive intent), and number of periods over completeness percentage (as a direct continuity indicator). The temporal and spatial HHI variables are retained given their low correlations with the remaining variables (

| r | < 0.42

), indicating that they contribute complementary information not captured by the magnitude variables. The final set of five variables for the clustering model is: average production (t), average sown_area (ha), number of observed periods, temporal HHI and spatial HHI.

Principal component analysis (PCA) as a methodological diagnostic.

To verify the variance structure of the selected variable set and inform the decision on data transformation prior to clustering, a principal component analysis is applied. The results show that the first component (PC1) explains 99.28% of the total variance, with a dominant loading of the average production (0.995), while the remaining components capture marginal fractions (PC2: 0.72%; PC3–PC5: 0.00%). This finding confirms that the variance structure of the dataset is completely dominated by the productive scale, supporting the decision to work with the data on their original scale, as detailed in Section 2.4.2.

2.4. Productive Scale Clustering

2.4.1. Algorithm Selection

The 490 department–crop pairs are grouped by productive scale using the K-Means algorithm [40], a partitional clustering method widely used in agricultural analytics for its computational efficiency, interpretability of the resulting centroids, and solid performance in datasets with compact group structures [12,14,16,36].

The algorithm proceeds as follows: given a predefined number of clusters k, k centroids are initialized in the feature space, either randomly or through improved methods such as K-Means++ to reduce the sensitivity to initialization. In the assignment step, each observation

x_{i}

is assigned to the cluster

C_{j}

whose centroid

μ_{j}

minimizes the squared Euclidean distance:

C_{j} = arg min_{j} {∥ x_{i} - μ_{j} ∥}^{2} .

(3)

In the update step, each centroid is recomputed as the mean of all observations assigned to its cluster:

μ_{j} = \frac{1}{| C_{j} |} \sum_{x_{i} \in C_{j}} x_{i} .

(4)

These two steps are repeated iteratively until the assignments no longer change between iterations (convergence) or a maximum number of iterations is reached. The global optimization criterion of the algorithm is to minimize total inertia (WCSS), described in Section 2.4.3. To ensure solution stability against random initialization, multiple runs with different random seeds are executed, and the run yielding the lowest WCSS is retained [41].

2.4.2. Data Treatment

Unlike the common practice of standardizing or transforming variables in logarithmic terms prior to clustering, this study works with data on their original scale. This decision is grounded in the recognition that the high asymmetry of the production distribution is not a statistical artifact to be corrected, but a structural characteristic of the Colombian agricultural system that the clustering algorithm is designed to capture. An empirical comparison among three treatment schemes (original data, logarithmic transformation with log1p, and standardization with StandardScaler) confirmed this criterion: with original data, K-Means identifies only 5 pairs as large scale (1.0% of the total), reflecting the actual concentration of production; with logarithmic transformation, the large scale group expands artificially to 132 pairs (26.9%), diluting the conceptual meaning of the category. The PCA diagnostic from Section 2.3 further supports this decision: compressing the variance dominated by average production through transformation is equivalent to removing the structural signal the algorithm is designed to detect. This decision is empirically validated in Section 3.2.1, where a formal sensitivity analysis demonstrates that both log1p transformation and standardization produce Adjusted Rand Index (ARI) values below 0.08 relative to the original solution, with cluster size distributions that diverge substantially from the unscaled result, confirming that normalization fundamentally alters the segmentation structure rather than merely rescaling it.

2.4.3. Determination of the Optimal Number of Clusters

The optimal number of clusters k is selected through the joint application of two complementary criteria.

The first is the elbow method [41], based on the analysis of within-cluster inertia (Within-Cluster Sum of Squares, WCSS):

W C S S = \sum_{k = 1}^{K} \sum_{x_{i} \in C_{k}} {∥x_{i} - μ_{k}∥}^{2},

(5)

where K is the number of clusters,

C_{k}

is the set of observations assigned to cluster k,

x_{i}

is the feature vector of observation i, and

μ_{k}

is the centroid of cluster k. As k increases, the inertia decreases monotonically. The criterion consists of identifying the inflection point where the rate of inertia reduction decreases significantly, forming an “elbow” in the curve that represents the optimal trade-off between cluster compactness and model parsimony [41].

The second criterion is the silhouette score, proposed by Rousseeuw [42], which simultaneously evaluates the internal cohesion of each cluster and its separation from the others. For each observation i:

s (i) = \frac{b (i) - a (i)}{\max {a (i), b (i)}},

(6)

where

a (i)

is the mean distance between observation i and all other observations in its cluster (internal cohesion), and

b (i)

is the minimum mean distance between i and observations in the nearest neighboring cluster (external separation). The global coefficient is obtained as:

\bar{s} = \frac{1}{n} \sum_{i = 1}^{n} s (i),

(7)

and takes values in [−1, 1], where values close to 1 indicate a well-defined segmentation, values near 0 signal observations on the boundary between clusters, and negative values suggest possible misassignments [42]. The results of applying both criteria to the study data are presented in Section 3.2.2.

2.4.4. Model Validation

The robustness of the segmentation is evaluated by running multiple model configurations, incrementally varying the number of included variables:

1.: Average production only;
2.: Average production and average sown_area;
3.: Adding number of periods;
4.: Adding temporal HHI;
5.: The full model with all five variables.

The consistency of assignments across configurations allows application of the parsimony principle and supports the selection of the final model as the specification that maximizes interpretive richness without compromising clustering quality.

2.5. Baseline Comparison: Quartile-Based Classification

As a reference to evaluate the suitability of the clustering approach, an alternative classification is constructed based on quartiles of the distribution of total accumulated production during the 2007–2024 period [43]. The Quartiles divide the distribution into four equal-size groups using the 25th, 50th, and 75th percentiles as cut-off points, defining the categories Q1 (production

\leq Q_{1}

), Q2 (

Q_{1} <

production

\leq Q_{2}

), Q3 (

Q_{2} <

production

\leq Q_{3}

) and Q4 production

> Q_{3}

).

Among the 490 pairs in the final dataset, the calculated values are

Q_{1} = 139.86

t,

Q_{2} = 776.47

t and

Q_{3} = 4056.90

t, yielding a nearly uniform distribution of 123, 122, 122 and 123 pairs per category, respectively. This classification serves as a baseline against which to contrast the clusters obtained through K-Means, evaluate within-category homogeneity through the coefficient of variation, and demonstrate the limitations of a fixed-threshold segmentation approach relative to one grounded in the natural structure of the data.

3. Results

3.1. Exploratory Data Analysis

The exploratory data analysis in this section is conducted on the aggregated departmental dataset for the period 2007–2024, prior to applying the temporal continuity threshold and the removal of specially designated crops described in Section 2.2.2. This dataset comprises 18,716 observations, 969 department–crop pairs, 96 crops, and 32 departments, with a total cumulative production of 206,357,108 t. The purpose of using this broader dataset is to characterize the Colombian transitory crop system as comprehensively as possible, incorporating the full diversity of department–crop combinations registered in the EVA during the study period.

Table 4 summarizes the global descriptive statistics of this dataset.

3.1.1. Overview of the Final Dataset

Figure 4 shows the annual distribution of records in the aggregated dataset for the period 2006–2024. The series exhibits a relatively stable pattern from 2006 onward, with moderate inter-annual variation and no evident structural breaks. The year 2018 constitutes an exception: it contains only observations from period A, without period B data, a situation associated with the transition between the two EVA sources described in Section 2.2.1. Since the period A records for 2018 are valid and contribute to the continuity of the series, the year is retained in the dataset; its particularity is reflected in the annual record count for that year and is taken into account in the interpretation of temporal continuity analyses. This behavior confirms the suitability of the 2006–2024 window as the analytical period.

Figure 5 and Figure 6 present the univariate distributions of the four productive variables in linear and logarithmic scale. Sown_area exhibits a highly asymmetric distribution (skewness = 10.66) spanning six orders of magnitude, from 0.01 to 207,751 ha; in logarithmic scale, a bimodal pattern is revealed, with concentrations around 1 ha (10⁰) and 100 ha (10²), associated with subsistence agriculture and commercial smallholding systems, respectively. Harvested_area shows similar distributional characteristics but with even more pronounced skewness (11.47) and extreme kurtosis (219.08), with a median of 78 ha versus a mean of 1056 ha; 166 observations (0.9%) present a zero value, corresponding to failed-sowing records identified during Phase 1 cleaning. Production exhibits a pronounced divergence between its median (577 t) and mean (11,018 t), with a mean-to-median ratio of approximately 19; approximately 91% of records fall below 50,000 t, while a small number of observations concentrates volumes exceeding one million tonnes; in logarithmic scale, a unimodal structure is observed with the main concentration around 1000 t, differentiated from the bimodal pattern of the area variables. Yield, calculated as the ratio of production to harvested_area (t/ha), ranges from 0 to 102.15 t/ha with a median of 7 t/ha and a mean of 10.50 t/ha; in logarithmic scale it reveals an irregular multimodal structure with concentrations around 1, 10, and 20 t/ha, reflecting the agronomic diversity of the 96 crops analyzed. This structural heterogeneity across all variables is consistent with the nature of the Colombian transitory crop system and has direct implications for the preprocessing and modeling decisions described in Section 2.

3.1.2. Analysis by Department–Crop Pair

Figure 7 presents the 10 department–crop pairs with the highest cumulative production, the highest average planted area, the highest average harvested area, and the highest average yield for the period 2007–2024. In the first three indicators (Figure 7a–c), a strong concentration is observed in combinations led by papa in Cundinamarca, Boyacá, and Nariño, and arroz in Casanare, Tolima, and Meta, which anticipates the results of the productive scale segmentation. The average yield (Figure 7d), in contrast, shows a different profile dominated by high-value vegetables and specialty crops such as tomate in Boyacá, Antioquia, and Nariño, as well as cebolla de rama and various leafy vegetables, demonstrating that productive efficiency is not necessarily associated with the highest production volumes.

Temporal continuity analysis identifies pairs with systematic presence throughout the historical series. Figure 8 shows the continuity distribution under current dataset conditions (Figure 8a) and under a theoretical scenario assuming future imputation of period 2018B (Figure 8b). Under current conditions, 252 pairs (26.01% of the total 969 pairs) are present in 100% of the 36 periods observed. In the theoretical scenario, with the series starting in 2007A and covering 35 complete periods, the number of pairs with perfect continuity increases to 255 (26.32%). The frequency distribution of active periods per pair reveals that a substantial fraction of the 969 pairs present short or discontinuous series, which underlies the need for the continuity threshold applied in the dataset preparation phase for clustering.

Table 5 summarizes this distribution by range.

3.1.3. Analysis by Crop

Figure 9 shows the most and least frequent crops in the national territory, measured by the number of pairs of department–crop pairs. Yuca and maíz are the dominant species in all 32 departments, followed by widely consumed crops such as ahuyama, tomate, cilantro, and frijol. In contrast, crops such as alcachofa, caléndula, champiñón, amaranto, and various aromatic herbs are registered in only one or two departments, reflecting their specialized or geographically restricted nature.

Figure 10 presents the 20 crops with the highest total cumulative production for the period 2007–2024. Papa contributes 29.3% of national transitory crop production, followed by arroz (25.9%) and maíz (12.3%). Together, these three crops account for more than 67% of sector output, while the majority of the 96 crops analyzed make marginal contributions to the national total.

Figure 11 in Panel (a) presents the spatial coverage of transitory crops, expressed as the number of departments in which each crop is registered as a producer. Yuca and maíz are present in all 32 departments, followed by crops such as ahuyama, tomate, cilantro and frijol with coverage above 28 departments. The Panel (b) shows the full distribution of all 96 crops by number of producing departments, revealing a markedly asymmetric structure: a small group of crops achieves broad national coverage, while the majority is present in fewer than ten departments. On average, each crop is registered in 10.1 departments, with a high spread throughout the distribution.

Figure 12 presents the spatial concentration measured by the Herfindahl–Hirschman Index (HHI) for the crops with the highest and lowest values. Panel (a) shows the 20 most concentrated crops, several of which approach an HHI of 1.0, indicating that their production is virtually monopolized by a single department; this pattern is characteristic of crops with highly specific agro-ecological requirements or strong regional productive traditions. The Panel (b) shows the 20 least concentrated crops, including yuca, ahuyama, maíz, frijol, and tomate, all of which have HHI values below 0.20, reflecting a more homogeneous distribution of production throughout the national territory. Taken together, both panels confirm that spatial coverage and spatial concentration are not equivalent dimensions: a crop can be widely distributed across departments, yet still have its production dominated by one or a few of them, while another may be present in a few departments, but with a balanced productive distribution among them.

3.1.4. Spatial Analysis

Figure 13 presents the 20 departments with the highest total cumulative production for the period 2007–2024. Cundinamarca concentrates approximately 16.0% of national production, followed by Boyacá (11.7%), Tolima (8.8%), Meta (8.0%), and Nariño (7.8%). These five departments together account for more than 52% of national production during the period, confirming that transitory crop output is organized around a few regional poles, particularly in the Andean region and the Eastern Plains, while several departments in the Caribbean and Amazon regions present comparatively lower contributions.

Figure 14 summarizes the diversity of the crops by department. Cundinamarca leads with 66 distinct registered crops (68.8% of the 96 crops in the dataset), followed by Nariño (61 crops) and Boyacá (59 crops). At the lower end, Guainía records only 5 crops, and other sparsely populated or geographically isolated departments show similarly limited productive portfolios. This territorial heterogeneity in the productive diversity suggests that agricultural specialization varies markedly between regions of the country.

Figure 15a,b present the department–crop coverage matrix as a presence/absence heat map for the full dataset (32 departments, 96 crops, 3072 possible pairs; observed density of 31.5%), divided into two panels by spatial coverage. Panel (a) displays the crops with the widest territorial distribution, while panel (b) shows those with a narrower coverage. Together, the two panels reveal that matrix density is markedly heterogeneous across crops: a small group with a broad national presence generates dense columns, whereas the majority of crops are registered in only a few departments, producing sparse patterns that underline the structural diversity of the Colombian transitory crop system. Figure 16 shows the Colombian map with transitory crop diversity by department, visually reinforcing these patterns and allowing identification of regional contrasts in productive richness throughout the national territory.

3.1.5. Temporal Analysis

Figure 17 presents the evolution of national total production between 2007 and 2024. Panel (a) shows the annual series with its linear trend, ranging from a minimum of 6.99 million tons in 2018 to a maximum of 16.09 million tons in 2020, with a mean of 11.46 million tons and a positive long-term trend. Panel (b) shows the annual percentage variation, where the most notable behaviors are a strong growth of approximately 19.5% in 2016, pronounced downturns in 2017 and 2021, and a sharp drop in 2018 associated with the partial coverage of that year described in Section 2.2.1.

Figure 18 presents the evolution of the total national planted area. Panel (a) shows the annual series with its linear trend, ranging from 1.09 to 2.24 million hectares; the minimum recorded in 2018 reflects the partial source coverage of that year, and the resulting drop of approximately 41% relative to 2017 constitutes the most extreme inter-annual behavior in the series. Panel (b) confirms that, outside this anomalous observation, the inter-annual variation in the planted area remains within a moderate range, with a long-term positive trend consistent with the expansion of department–crop combinations registered in the EVA.

Figure 19 presents the evolution of the number of active department–crop pairs. Panel (a) shows a clear upward trend, rising from 486 pairs in 2007 to 705 in 2024, a cumulative increase of approximately 45% that reflects the progressive incorporation of new department–crop combinations into the EVA registry throughout the study period. Panel (b) shows the year-on-year variation, which is predominantly positive and with moderate amplitude, reinforcing the interpretation of a gradual and sustained expansion of the registered productive universe rather than abrupt structural changes.

Figure 20 compares the distribution of the four productive variables between periods A and B through box plots. The four panels reveal that period A consistently presents higher central values and wider interquartile ranges than period B across all variables, reflecting a higher productive volume in the first semester. To assess whether these differences are statistically significant, the non-parametric Mann–Whitney U test for independent samples is applied to each variable. The results do not indicate systematic differences between periods: sown_area (

U = 43, 954, 785.5

;

p = 0.4021

), harvested_area (

U = 43, 753, 198.0

;

p = 0.7706

), production (

U = 43, 891, 879.5

;

p = 0.5045

), and yield (

U = 43, 850, 121.0

;

p = 0.5794

). Since all p-values exceed the significance level

α = 0.05

, the null hypothesis of equal medians across periods is not rejected for any variable.

Figure 21 presents the cumulative totals for the four productive variables disaggregated by period for the complete 2007–2024 series. Period A concentrates on a higher planted area (19.27 million ha versus 13.47 million ha in B), a higher harvested area (16.91 vs. 13.45 million ha) and a higher total production (115.54 vs. 90.80 million t). These differences in cumulative totals reflect coverage and record-count factors—period A systematically registers more active department–crop pairs per year—rather than a structural asymmetry in the productive dynamics between semesters, as confirmed by the Mann–Whitney U test results reported above.

3.1.6. Correlation Analysis

Figure 22 presents the Pearson correlation matrix between sown_area, harvested_area, production, and yield. The correlations between the first three variables are strong and statistically significant (

p < 0.001

): the correlation between sown_area and harvested_area is the highest in the set, followed by the correlations of both area variables with production. Yield, in contrast, presents very weak correlations with the area variables (coefficients close to zero, slightly negative); Spearman rank-based analysis confirms a moderate correlation between yield and production (

ρ \approx 0.35

), supporting its interpretation as a scale-independent productive efficiency indicator.

Figure 23 presents the linear regression analysis between harvested_area and production. Residual analysis (Figure 23a,b) reveals pronounced heteroscedasticity and non-normality of residuals, confirmed by the Shapiro–Wilk test (statistic = 0.3844;

p < 0.000001

) and the D’Agostino–Pearson test (statistic = 29,028.29;

p < 0.000001

), suggesting that the area–production relationship is not uniformly linear across all productive scale ranges. The model fitted to 18,550 observations (Figure 23c) is the following:

{\hat{production}}_{i} = 1250 + 6.03 \cdot harvested_{area}_{i},

(8)

with

R^{2} = 0.595

,

r = 0.771

, and

p < 0.001

. The coefficient indicates that, on average, each additional hectare harvested is associated with an increase of 6.03 tons in production.

Figure 24 presents the linear regression analysis between sown_area and harvested_area. Residual analysis (Figure 24a,b) reveals pronounced heteroscedasticity consistent with the scale structure of the data, with residuals concentrated near zero for the vast majority of observations and a small number of high-leverage points at large area values; the residual distribution confirms strong leptokurtosis, indicating non-normality driven by extreme cases rather than by a systematic bias in the fitted model. The model fitted to 18,550 observations (Figure 24c) is the following:

{\hat{harvested_area}}_{i} = 105.61 + 0.868 \cdot sown_{area}_{i},

(9)

with

R^{2} = 0.829

and

r = 0.910

. The mean harvested_area/sown_area ratio is 0.9479 (median: 0.9731), indicating that under typical conditions between 95% and 97% of the planted area is harvested, a value consistent with the loss rates observed in Colombian transitory crops and supporting the imputation approach applied in Section 2.2.1.

The exploratory analysis presented in the preceding sections characterizes the complete structure of the Colombian transitory crop system over the period 2007–2024, considering the broadest available set of department–crop combinations in the EVA. However, for the productive scale segmentation stage, including pairs with short or discontinuous time series would compromise the quality and interpretability of the calculated features, particularly along the dimensions of trend, volatility, and seasonality. For this reason, the temporal continuity threshold (≥15 semi-annual periods with positive production) and the removal of crops with special denomination are applied as described in Section 2.2.2, reducing the dataset from 969 to 490 pairs, from 96 to 56 crops, and from 18,716 to 15,077 records, with coverage of 90.65% of total cumulative production. This final dataset, which concentrates pairs with sufficient temporal depth for robust feature characterization, constitutes the basis for the productive scale clustering described in Section 3.2 and is publicly available in the Zenodo repository under a CC BY 4.0 license [44].

3.2. Productive Scale Segmentation

3.2.1. Optimal Number of Clusters

Figure 25 presents the results of the elbow method and the silhouette coefficient for values of k between 2 and 6. The elbow method shows a pronounced reduction in inertia between

k = 2

and

k = 4

, with progressive stabilization from

k = 3

onward. The silhouette coefficient presents exceptionally high values across all evaluated values of k, reaching its maximum at

k = 2

and decreasing moderately to

k = 3

, with a more abrupt drop for

k \geq 4

.

The selection of

k = 3

is supported by three complementary criteria. First, semantic interpretability: three clusters allow for a classification that can be directly interpretable as small, medium, and large scale, aligned with the categories used in agricultural literature and in Colombian sectoral policy. Second, statistical viability: with

k = 4

, the upper cluster contains only one pair, producing an individual silhouette coefficient of zero due to the absence of neighboring observations in the same group, which invalidates the evaluation of internal cohesion. Third, balance between granularity and robustness: although

k = 2

yields the highest silhouette score, a binary segmentation is overly simplistic to capture the diversity of productive scales present in the sector. The robustness of this selection is further supported by the incremental variable sensitivity analysis described in Section 2.4.4, which confirms that cluster assignments remain stable across five progressively enriched model configurations, from a single-variable specification to the full five-variable model.

Robustness Analysis

To assess the robustness of the K-Means solution, two complementary analyses were conducted. First, an alternative clustering method (agglomerative hierarchical clustering with Ward linkage,

k = 3

) was applied to the same feature space. Second, the sensitivity of cluster assignments to the scaling scheme was evaluated by comparing the original (unscaled) solution against logarithmic transformation (log1p) and standardization (StandardScaler). Table 6 presents the results. The hierarchical clustering solution achieves a silhouette coefficient of 0.865, compared to 0.888 for K-Means, and identifies the exact same five large-scale pairs (Cundinamarca–papa, Boyacá–papa, Nariño–papa, Casanare–arroz, and Tolima–arroz). The Adjusted Rand Index (ARI) between the two methods is 0.728, indicating substantial agreement. The differences between solutions are confined to borderline pairs at the small/medium scale boundary, confirming that the core productive scale structure is method-invariant. Regarding scaling sensitivity, both log1p transformation and standardization produce ARI values below 0.08 relative to the original solution, with substantially different cluster size distributions. This confirms that normalization fundamentally alters the segmentation by compressing the structural asymmetry of the production distribution—precisely the feature the clustering is designed to capture. A Gaussian Mixture Model (GMM) was also evaluated but produced a degenerate solution (silhouette = −0.073) due to the violation of the multivariate normality assumption inherent in the highly asymmetric production distribution, confirming that K-Means is the more appropriate algorithm for this data structure.

3.2.2. Cluster Characterization

The K-Means model with

k = 3

achieved a global silhouette coefficient of 0.89, indicating a well-defined cluster structure. Table 7 presents the silhouette statistics by cluster.

Only 3 pairs (0.6%) present negative silhouette coefficients, all belonging to the medium-scale cluster, suggesting borderline cases between the medium and small categories without compromising the overall validity of the segmentation. Figure 26 presents the silhouette plot by cluster (Figure 26a) and the two-dimensional PCA projection with color-coded clusters (Figure 26b), confirming the natural separation of groups along PC1, which concentrates 99.3% of the explained variance and is dominated almost exclusively by average production (loading = 0.995).

The segmentation distinguishes three categories with clearly differentiated productive characteristics, with natural breakpoints at approximately 35,386 t and 275,959 t of average production: the small scale groups 459 pairs (93.7%) with average production below 35,386 t; the medium scale comprises 26 pairs (5.3%) with average production between 35,386 and 275,959 t; and the large scale concentrates only 5 pairs (1.0%) with average production above 275,959 t. This markedly asymmetric distribution reflects a structural feature of the Colombian transitory crop system: the vast majority of department–crop combinations operate at output levels that are one to two orders of magnitude below those of the dominant pairs, suggesting that high-volume production is concentrated in a small number of specialized combinations rather than distributed across the productive territory.

Table 8 presents the centroids of the final model for the five clustering variables.

The five pairs classified as large scale are presented in Table 9 and correspond to the main transitory crop production systems of Colombian agriculture: three combinations with papa (Cundinamarca, Boyacá, and Nariño) and two with arroz (Casanare and Tolima), all with presence in the 35 available periods and average production ranging from 341,163 to 709,923 t, indicating sustained production distributed homogeneously throughout the entire historical series. The perfect temporal continuity and the low within-cluster coefficient of variation (30.6%) jointly indicate that these systems are not only the largest in output magnitude but also the most structurally stable, with demand conditions, productive capacity, and institutional support sufficiently consolidated to sustain high output levels across all observed semi-annual periods. This combination of scale and stability positions these five pairs as the strategic core of the Colombian transitory crop sector, warranting differentiated policy instruments focused on competitiveness, risk management, and long-term productivity.

3.2.3. Distribution by Department and Crop

Figure 27 presents the geographic distribution of the 490 department–crop pairs according to their productive scale. The five large-scale pairs are distributed across an equal number of departments, corresponding to two fundamental crops of Colombian agriculture: papa and arroz. Papa dominates in the Andean departments (Cundinamarca, Boyacá, and Nariño), while arroz leads in the rice-producing zones (Tolima and Casanare). This concentration in only two crops reflects the productive structure of the country, where few agricultural systems reach average volumes exceeding 341,000 t per period.

The 26 medium-scale pairs are distributed across 14 departments and span 9 distinct crops (arroz, maíz, papa, tomate, cebolla de bulbo, cebolla de rama, soya, patilla and zanahoria). Meta leads with 4 pairs (arroz, maíz, soya and patilla), followed by Boyacá and Norte de Santander with 3 pairs each; Bolívar, Huila, Antioquia, Córdoba, and Cesar each contribute 2 pairs. Arroz and maíz are the most frequent crops in this category with 7 pairs each, followed by papa (4) and tomate (3); cebolla de bulbo, cebolla de rama, soya, patilla, and zanahoria contribute 1 pair each. The concentration of medium-scale pairs in the Eastern Plains and the northeastern Andean corridor reflects the expansion of technified agriculture in these regions over the study period. The near-perfect mean temporal continuity of this cluster (34.81 periods) indicates that these are not incipient or transitional systems but consolidated productive configurations operating below the output threshold that defines the large scale. This intermediate position makes the medium scale the most policy-relevant tier for targeted interventions aimed at scaling up output, improving market access, and strengthening value chain integration.

The 459 small-scale pairs are distributed across all 32 departments of the country and span all 56 crops in the dataset. Cundinamarca leads with 37 pairs, followed by Cauca (33), Boyacá (30), Nariño (29), Norte de Santander (28), and Santander and Valle del Cauca with 27 pairs each. The most represented crops are frijol (26 pairs), maíz (25), ahuyama (23), tomate (22), patilla (22), and ají (20). The departments in the Orinoquía and Amazon regions present fewer pairs but maintain a presence in this category, reflecting agricultural diversification even in territories with limited market access and infrastructure. The universal territorial coverage of the small scale—the only category present in all 32 departments and across all 56 crops—distinguishes it qualitatively from the medium and large scales: it is not a residual category but the structural foundation of food crop diversity in Colombia. The lower mean temporal continuity of this cluster (30.49 periods, compared to 34.81 and 35.00 for medium and large scale, respectively) reflects greater vulnerability to productive discontinuities, consistent with the exposure of small-scale agriculture to climatic, market, and institutional shocks. These characteristics call for differentiated support strategies oriented toward food security, productive resilience, and rural extension rather than competitiveness-focused instruments.

The spatial analysis confirms that productive scale is not associated with specific geographic regions, but rather with particular department–crop combinations determined by agro-ecological conditions, productive tradition, and installed capacity. Boyacá and Cundinamarca appear in all three scale categories, while departments such as Amazonas, Vaupés, and Guainía participate exclusively in the small scale. Similarly, crops such as papa, arroz, and maíz appear across all three scales, while specialized crops such as espárrago and albahaca are exclusive to the small scale. This decoupling between geographic location and productive scale is a substantively important finding: it implies that policies designed at the departmental level will systematically conflate very different productive realities within the same administrative unit, and that effective targeting of agricultural support requires operating at the department–crop pair level. The heterogeneity identified here cannot be captured by either departmental aggregates or crop-level national statistics alone, which reinforces the analytical value of the pair-level segmentation proposed in this study.

Table 10 and Table 11 summarize the distribution of pairs by scale between departments and crops, respectively.

3.3. Validation Against Quartile-Based Classification

Before applying clustering, a quartile-based classification of average production was constructed as a reference method, as described in Section 2.5. The quartiles divide the distribution into four equal-size groups (25% each) using the 25th, 50th and 75th percentiles as cut-off points, with values

Q_{1} = 139.86

t,

Q_{2} = 776.47

t and

Q_{3} = 4056.90

t. Although this approach is computationally simple and widely used in descriptive analyzes, it presents important conceptual limitations when the objective is to identify groups with distinctive productive characteristics.

Table 12 presents the cross-tabulation between both classifications. The quartiles generated four groups of nearly identical size:

Q_{1}

with 123 pairs (25.1%),

Q_{2}

with 122 pairs (24.9%),

Q_{3}

with 122 pairs (24.9%), and

Q_{4}

with 123 pairs (25.1%). In contrast, the clustering identified a markedly asymmetric structure: 459 small-scale pairs (93.7%), 26 medium-scale pairs (5.3%), and only 5 large-scale pairs (1.0%), with natural breakpoints at approximately 35,386 and 275,959 t of average production.

Cross-tabulation reveals that the 5 pairs classified as large scale clustering originate exclusively from the quartile

Q_{4}

, as do the 26 medium-scale pairs. However,

Q_{4}

additionally includes 92 pairs that the clustering classifies as small scale, illustrating the fundamental limitation of quartile-based methods: by forcing a uniform distribution, they place pairs with radically different productive characteristics into the same category. The quartiles

Q_{1}

,

Q_{2}

, and

Q_{3}

contain only pairs on a small scale, indicating that the threshold

Q_{3}

effectively acts as a boundary between the small scale and the upper categories.

Quantification of internal homogeneity through the coefficient of variation (CV) confirms this difference conclusively. The

Q_{4}

group presents a CV of 223.9%, with a production range of 3752 to 709,923 t (a difference of 706,171 t), reflecting an extreme dispersion in which the standard deviation far exceeds the mean. In contrast, the large-scale cluster achieves a CV of 30.6%, with pairs whose production ranges from 341,163 to 709,923 t (a difference of 368,759 t), reflecting a considerably more homogeneous group. Small- and medium-scale clusters present CVs consistently below those of their quartile equivalents in the corresponding production range.

Figure 28 illustrates this difference through a two-panel comparative scatter plot. The left panel shows the segmentation based on quartiles, where pairs are uniformly distributed across categories with cut-off lines arbitrarily close to the origin, concentrating on

Q_{4}

both the major producers and the majority of pairs with moderate production. The right panel shows clustering-based segmentation, which recognizes that the vast majority of pairs produce relatively small volumes, an intermediate group presents moderate production, and only five pairs stand out as genuine large-scale producers. From a methodological perspective, the quartile classification presents an inherent limitation: two pairs in the 74.9th and 75.1st percentiles are assigned to different categories despite being practically identical in production, while pairs in the 75.1st and 99.9th percentiles are grouped in the same category

Q_{4}

despite differences of several orders of magnitude. Clustering, on the contrary, identifies natural discontinuities in the distribution and groups observations according to their actual similarity in the five-variable feature space.

In summary, although quartiles constitute a useful tool for preliminary descriptive analyzes, clustering is methodologically superior when the objective is to identify groups with real productive meaning. The clustering-based segmentation not only generates more internally homogeneous groups but also captures the asymmetric structure inherent to Colombian agricultural production, where a small number of department–crop combinations concentrate the largest share of the sector’s total output.

4. Discussion

4.1. The EDA as an Interpretive Foundation for Clustering

The results of the exploratory data analysis are not independent of the clustering results; instead, they anticipate and explain them. The extreme asymmetry of the productive distributions (skewness greater than 10 for planted area, harvested area, and production, with a mean-to-median ratio of approximately 19 for production) indicated from the outset that the productive space of Colombian transitory crops does not lend itself to segmentation based on uniform distributions. The fact that 91% of records fall below 50,000 t, and that only five pairs exceed 341,000 t of average production, defines a system in which heterogeneity is not noise but structure. The K-Means solution with

k = 3

formally captures this structure, identifying natural break points at approximately 35,386 t and 275,959 t that emerge from the data itself rather than from arbitrary analytical decisions. The resulting distribution (93.7% small scale, 5.3% medium scale, and 1.0% large scale) is consistent with what the EDA already suggested: a system dominated by dispersed small producers, with a small number of high-scale productive systems concentrating a disproportionate share of national output.

The spatial concentration measured by the HHI complements this interpretation, although interpreting it correctly requires distinguishing spatial concentration from productive volume. Crops with spatial HHI close to 1 (achicoria, anís, amaranto, alcachofa, champiñón, and romero, among others) have their production virtually confined to one or two departments; these are mostly niche crops or highly localized categories that appear exclusively on a small scale. At the opposite extreme, low HHI crops (yuca, ahuyama, maíz, frijol, tomate, arroz and cilantro) have a wide territorial distribution, with the cumulative share of the three leading departments ranging between 37% and 70%, indicating that although there are dominant producing regions, a significant fraction of production is distributed across other departments. The case of arroz clearly illustrates that both dimensions are independent: Casanare and Tolima concentrate the two large-scale pairs of this crop, yet its low HHI reflects that national rice production extends to Meta, Huila, Norte de Santander and other departments on medium and small scale. This independence between volume and spatial concentration validates the decision to include the HHI as a clustering variable alongside the productive metrics, as it contributes structural information that area and production measures alone do not capture.

4.2. Large Scale: Structural Stability and Productive Dominance

The five large-scale pairs (papa in Cundinamarca, Boyacá, and Nariño, and arroz in Casanare and Tolima) exhibit three characteristics that set them apart qualitatively from the rest of the system. First, perfect continuity: all five pairs are present in the 35 available periods, indicating mature productive systems with stable demand and consolidated installed capacity. Second, notable internal homogeneity: the coefficient of variation of the large-scale cluster is 30.6%, compared to 223.9% for the quartile

Q_{4}

, confirming that these pairs are genuinely comparable in terms of scale. Third, their cumulative production represents a dominant share of the sectoral total, with Cundinamarca contributing 16.0% of national transitory crop production and Boyacá 11.7%, while Casanare and Tolima concentrate the main volumes of national rice production.

These results are consistent with the literature on staple crops in national food systems [45,46], where certain department-crop complexes acquire a structural role in food supply that transcends short-term market dynamics. Andean potatoes and rice for the Eastern Plains constitute the pillars of basic food supply in Colombia, with differentiated value chains, consolidated sectoral organization, and presence in specific policy instruments. The segmentation obtained provides a quantitative empirical basis for this characterization, which until now relied primarily on qualitative analyses or aggregate statistics at the national level. Moreover, the identification of papa and arroz as exclusive components of the large scale reinforces the argument put forward by Munialo et al. [46] regarding the concentration of research and institutional attention on a small number of staple crops, to the detriment of the broader productive diversity documented in the small-scale segment.

Although formal external validation against independent quantitative indicators—such as agricultural GDP contribution or export intensity—is not feasible at the department–crop pair level given the unavailability of such data at the required granularity in Colombia, the clusters exhibit qualitative consistency with sectoral knowledge. The five large-scale pairs correspond precisely to the productive systems with the highest degree of institutional development and gremial organization in the Colombian agricultural sector: papa is represented by Fedepapa and arroz by Fedearroz, both of which are among the most consolidated sectoral associations in the country, with presence in policy instruments such as price stabilization funds, ICR-based credit programs, and agricultural insurance schemes. This institutional recognition is independent of and predates the present analysis, providing qualitative external evidence that the large-scale cluster identifies systems that are genuinely structurally distinct—not merely statistically cohesive. Additionally, the validation against the quartile-based classification presented in Section 3.3 constitutes a form of methodological external validation: demonstrating that the clustering solution outperforms the standard reference method (CV of 30.6% vs. 223.9% for the upper segment) substantiates that the clusters capture real productive structure rather than statistical artifacts.

4.3. Medium Scale: Transitional Productive Corridors

The 26 medium-scale pairs constitute the analytically richest segment of the system, as they represent the transition zone between subsistence agriculture and large-scale productive systems. The only three pairs with negative silhouette coefficients (all belonging to this cluster) indicate that the boundary between medium- and small scale is diffuse, which is consistent with the literature on productive transitions in family agriculture [47,48,49]: there is no sharp threshold between the two categories, but rather a continuum where some pairs are in a process of upscaling or retraction.

Arroz and maíz lead this category with seven pairs each, and both crops also appear on the large and small scales, confirming that it is not the crop but the department-crop combination that constitutes the relevant unit of analysis. Meta emerges as the leading medium scale department with four pairs (arroz, maíz, soya and patilla), followed by Boyacá and Norte de Santander with three pairs each, suggesting that the Eastern Plains and the northeastern Andean corridor form the main medium-scale productive zones of the country. This geography is consistent with the expansion of the technified agricultural frontier in Meta over the past two decades and with the tradition of rice and maize cultivation in the Norte de Santander–Cesar–Córdoba corridor. The mean continuity of 34.81 periods (practically identical to the large scale) indicates that these systems are stable over time despite their lower volume, which distinguishes them from the small scale (30.49 periods on average) and reinforces their characterization as consolidated productive systems with the capacity to respond to market cycles. This stability is particularly relevant in the Colombian context, where family and smallholder farming constitutes the backbone of the national food system [48,49], and where the medium scale can represent a strategic tier for targeted productivity and market access interventions.

4.4. Small Scale: Territorial Diversity as a Structural Feature

That 93.7% of pairs are classified as small scale should not be interpreted as a statistical residual but as a defining characteristic of the Colombian agricultural system. The distribution of these 459 pairs across all 32 departments of the country and all 56 crops in the dataset reveals that the small scale is the foundation of territorial food diversity, a role that has been consistently documented for family agriculture in Latin America [47,48]. Crops such as frijol (26 pairs), ahuyama (23), ají (20), habichuela (18) and pepino cohombro (18) are exclusive or nearly exclusive to this category, and their presence in multiple departments constitutes a network of local food production that is not visible in national aggregates.

The mean continuity of 30.49 periods, lower than that of the higher categories, indicates greater volatility in entry and exit from the productive registry, consistent with the vulnerability of small-scale agriculture to climatic, economic and policy shocks [49]. Departments such as Amazonas, Vaupés, and Guainía participate exclusively in this category, showing that in the Amazonian and Orinocan periphery, agricultural activity recorded in the EVA responds to subsistence and local diversification dynamics rather than to regional market forces. Cundinamarca leads with 37 small-scale pairs, demonstrating that even the most productive department in the country (with three large-scale pairs) maintains a broad base of diversified small-scale production, a result consistent with its high crop diversity (66 crops registered, 68.8% of the total dataset).

4.5. Clustering Versus Quartile Classification: A Methodological Argument

The quartile

Q_{4}

groups 123 pairs with an average production between 3752 and 709,923 t, a range of 706,171 t and a CV of 223.9%. The large-scale cluster groups 5 pairs with production between 341,163 and 709,923 t and a CV of 30.6%. The difference is qualitative: the quartile-based method places radically different productive systems in the same category, while clustering identifies genuinely homogeneous groups. This limitation of percentile-based classifications has been reported in other domains involving asymmetric distributions, where data-driven segmentation consistently outperforms fixed-threshold approaches in terms of within-group homogeneity [50,51].

The practical implication is straightforward: if quartiles were used to target large-scale agricultural support programs, 92 pairs that the cluster correctly classifies as small scale would be included, with the risk of allocating competitiveness or technology transfer instruments to systems that require food security or rural extension interventions [47]. Nurkhofifah et al. [51] reached analogous conclusions in the context of educational disparity classification, showing that K-Means segmentation reveals multidimensional patterns that quartile grouping systematically obscures. This result suggests that the adoption of unsupervised grouping methods should be considered standard practice in sectoral productive characterization studies when the variables of interest exhibit markedly asymmetric distributions.

4.6. Replicability and Broader Methodological Applicability

The pipeline developed in this study (structured preprocessing, temporal continuity threshold, feature engineering with HHI, and K-Means segmentation with silhouette validation) is directly replicable in any national agricultural statistics system that records area, production, and yield at the subnational level with semi-annual or annual temporal resolution. Analogous cases include the Pesquisa Agrícola Municipal (PAM) of IBGE in Brazil at the municipal level [48], the district-level data of ICRISAT in India, the state-level records of SIAP in Mexico, and the regional statistics of EUROSTAT for the countries of the European Union. The only operational condition is the availability of sufficiently long time series; the threshold of 15 periods proposed here can be adjusted according to the temporal granularity of each source.

The use of HHI as a clustering feature to simultaneously capture spatial concentration and temporal stability of production is not common in the agricultural clustering literature; while the individual methods employed are established, their combination in this application context provides a practical demonstration of how concentration indices can enrich productive characterization beyond volume-based variables alone.

5. Conclusions

This study produced a statistically robust and semantically interpretable productive scale segmentation of 490 Colombian transitory crop department–crop pairs (2007–2024), validated by a global silhouette coefficient of 0.888, revealing a markedly asymmetric structure: 93.7% small scale (459 pairs, 56 crops, 32 departments), 5.3% medium scale (26 pairs, 9 crops, 14 departments), and 1.0% large scale (5 pairs, 2 crops: papa and arroz, 5 departments), with natural breakpoints at ≈35,386 t and ≈275,959 t. The quantitative superiority of clustering over quartile-based classification was demonstrated by the reduction in the within-group CV from 223.9% (

Q_{4}

) to 30.6% (large-scale cluster). Together with the descriptive characterization of the full EVA dataset, these outputs constitute the first empirical evidence base for productive scale differentiation in the Colombian transitory crop sector, supported by a replicable pipeline and an open dataset published under CC BY 4.0 [44]. The segmentation has direct implications for the differentiated design of agricultural policy instruments in Colombia. The five large-scale pairs, with average production between 341,163 and 709,923 t and perfect continuity over 35 periods, represent structurally stable systems that anchor national food supply and are candidates for instruments such as price support mechanisms, export promotion programs, and sectoral risk management schemes—interventions that presuppose the productive maturity and institutional capacity that this cluster demonstrably possesses. The 26 medium-scale pairs, concentrated in Meta, Boyacá, and Norte de Santander, constitute a strategic tier where targeted investments in technology transfer, rural credit access, and market linkage programs are most likely to generate measurable scaling effects, given their demonstrated productive stability and intermediate output levels. The 459 small-scale pairs, distributed throughout the national territory and serving as the foundation of regional food diversity, require approaches specifically oriented toward food security, climate resilience, and rural extension—not competitiveness-focused instruments—given their greater productive volatility and their structural role in sustaining local dietary diversity rather than national commodity supply.

Two main limitations should be considered when interpreting the results. Departmental aggregation, while consistent with the objective of the study, conceals the internal productive heterogeneity of each department, where municipalities with very different dynamics may coexist within the same pair. Additionally, the clustering variables are strictly productive in nature and do not incorporate socioeconomic dimensions such as access to credit, the level of mechanization, or integration of the value chain, which could enrich the characterization of each scale. Regarding the preprocessing stage, the deterministic imputation applied to zero-value and inconsistent records assumes that the predominant mechanism is measurement error rather than structural absence of production. While this assumption is supported by the agronomic nature of the affected cases and the empirically verified identity relationships used for imputation, it cannot be formally tested. The imputed volume is small (fewer than 1.4% of records across all variables), which limits the potential impact on clustering outcomes; however, in departments with historically weaker reporting systems, residual systematic missingness cannot be fully ruled out. Three future research directions follow directly from this work. The static nature of this analysis is a deliberate methodological choice consistent with the scope of this first article in the series: establishing a structurally grounded productive scale characterization over the full 2007–2024 horizon requires collapsing temporal dynamics into summary statistics to produce stable, interpretable cluster profiles. Extending the analysis to dynamic frameworks—such as time-sliced clustering, trajectory-based classification, or sub-period stability testing—would substantially increase the scope and length of the manuscript beyond what is appropriate for a single article, and would conflate two analytically distinct objectives. In the short term, the second article in this series addresses precisely these dynamic questions and will apply multidimensional temporal clustering to the same 490 pairs, incorporating trend, volatility, seasonality, and continuity of the series as variables, using the scale segmentation obtained here as a stratification criterion. That second article will directly address the temporal robustness of the productive structures identified here and assess whether clusters remain stable across sub-periods. In the medium term, large- and medium-scale pairs, with complete series and high continuity, are direct candidates for production forecasting models (SARIMA, Prophet, or recurrent neural networks), whose construction will be facilitated by the structural characterization developed in this work. In the long term, replication of the pipeline in other national agricultural statistics systems will enable international comparisons of subnational productive structure and allow evaluation of the generalizability of the patterns identified for Colombia.

Author Contributions

Conceptualization, N.D.M. and S.-C.V.-A.; methodology, N.D.M.; software, N.D.M.; validation, N.D.M., J.B.-V. and S.-C.V.-A.; formal analysis, S.-C.V.-A.; investigation, N.D.M.; resources, N.D.M.; data curation, N.D.M.; writing—original draft preparation, N.D.M., J.B.-V. and S.-C.V.-A.; writing—review and editing, N.D.M. and S.-C.V.-A.; visualization, N.D.M.; supervision, J.B.-V. and S.-C.V.-A.; project administration, J.B.-V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the Zenodo repository at https://doi.org/10.5281/zenodo.19021281. These data were derived from the following resources available in the public domain: Evaluaciones Agropecuarias Municipales (EVA) 2007–2018 and 2019–2024, published by the Colombian Ministry of Agriculture and Rural Development (MADR) at https://www.datos.gov.co.

Acknowledgments

During the preparation of this manuscript, the authors used Claude Sonnet 4.5 and 4.6 (Anthropic, https://claude.ai, accessed 1 January 2025–13 April 2026) for the purposes of academic writing assistance and language refinement in English, LaTeX document formatting, literature organization, and review of internal consistency across sections. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

General abbreviations
CC BY 4.0	Creative Commons Attribution 4.0 International License
CRISP-DM	Cross-Industry Standard Process for Data Mining
CV	Coefficient of Variation
EVA	Evaluaciones Agropecuarias Municipales
EUROSTAT	Statistical Office of the European Union
HHI	Herfindahl–Hirschman Index
IBGE	Instituto Brasileiro de Geografia e Estatística
ICRISAT	International Crops Research Institute for the Semi-Arid Tropics
MADR	Colombian Ministry of Agriculture and Rural Development (Ministerio de Agricultura y Desarrollo Rural)
PAM	Pesquisa Agrícola Municipal
PCA	Principal Component Analysis
SDG	Sustainable Development Goal
SIAP	Servicio de Información Agroalimentaria y Pesquera
UAV	Unmanned Aerial Vehicle
WCSS	Within-Cluster Sum of Squares
Variable names (article) and EVA source counterparts
sown_area (alt. planted_area)	area_sembrada in EVA source and dataset
harvested_area	area_cosechada in EVA source and dataset
production	produccion in EVA source and dataset
yield	rendimiento in EVA source and dataset
department	departamento in EVA source and dataset
crop	cultivo in EVA source and dataset
year	anio in EVA source and dataset
period	periodo in EVA source and dataset
crop_cycle	ciclo_cultivo in EVA source and dataset
Crop names: Spanish (EVA source and text)—English equivalent
achicoria	Chicory
ahuyama	Pumpkin/squash
ají	Chili pepper
albahaca	Basil
alcachofa	Artichoke
amaranto	Amaranth
anís	Anise
arroz	Rice
arveja	Pea
caléndula	Marigold
cebolla de bulbo	Bulb onion
cebolla de rama	Spring onion/green onion
champiñón	Mushroom
cilantro	Coriander/cilantro
espárrago	Asparagus
frijol	Common bean
habichuela	Green bean/French bean
lechuga	Lettuce
maíz	Corn/maize
melón	Melon
papa	Potato
patilla	Watermelon
pepino cohombro	Cucumber
pimentón	Bell pepper
romero	Rosemary
sorgo	Sorghum
soya	Soybean
tomate	Tomato
yuca	Cassava
zanahoria	Carrot

References

Han, S.Z.; Pan, W.T.; Zhou, Y.Y.; Liu, Z.L. Construct the prediction model for China agricultural output value based on the optimization neural network of fruit fly optimization algorithm. Future Gener. Comput. Syst. 2018, 86, 663–669. [Google Scholar] [CrossRef]
Ceballos-Freire, A.J.; Andrés, M.D.; Benavides-Martínez, I.F.; Tobar, C.J. Characterization of land use, from a social, economic and environmental dynamics. Rev. Cienc. Agríc. 2024, 41, e1228. [Google Scholar] [CrossRef]
Ministerio de Agricultura y Desarrollo Rural. El Sector Agricultura, Protagonista en 2024 de la Reactivación Económica del País; Ministerio de Agricultura y Desarrollo Rural: Bogotá, Colombia, 2024. [Google Scholar]
Departamento Administrativo Nacional de Estadística (DANE). Encuesta Nacional Agropecuaria ENA 2023; Technical Report; DANE: Bogotá, Colombia, 2023. [Google Scholar]
OECD; FAO. OCDE-FAO Perspectivas Agrícolas 2020–2029; OECD Publishing: Paris, France, 2020. [Google Scholar] [CrossRef]
Comisión Económica para América Latina y el Caribe (CEPAL). La Transformación Rural Sostenible en América Latina y el Caribe; Technical Report; Naciones Unidas: Santiago, Chile, 2022. [Google Scholar]
Araújo, S.O.; Peres, R.S.; Ramalho, J.C.; Lidon, F.; Barata, J. Machine Learning Applications in Agriculture: Current Trends, Challenges, and Future Perspectives. Agronomy 2023, 13, 2976. [Google Scholar] [CrossRef]
Kuan, C.H.; Leu, Y.; Lin, W.S.; Lee, C.P. The Estimation of the Long-Term Agricultural Output with a Robust Machine Learning Prediction Model. Agriculture 2022, 12, 1075. [Google Scholar] [CrossRef]
Jabed, M.A.; Azmi Murad, M.A. Crop yield prediction in agriculture: A comprehensive review of machine learning and deep learning approaches, with insights for future research and sustainability. Heliyon 2024, 10, e40836. [Google Scholar] [CrossRef]
van Klompenburg, T.; Kassahun, A.; Catal, C. Crop yield prediction using machine learning: A systematic literature review. Comput. Electron. Agric. 2020, 177, 105709. [Google Scholar] [CrossRef]
Angelidis, G. A Mathematical Framework for Modeling Global Value Chain Networks. Foundations 2026, 6, 8. [Google Scholar] [CrossRef]
Farismana, R. PENERAPAN K-MEANS CLUSTERING UNTUK PEMETAAN PRODUKTIVITAS PADI DAN PREDIKSI PANEN DI KABUPATEN INDRAMAYU. J. Inf. Syst. Appl. Manag. Account. Res. 2024, 8, 589–605. [Google Scholar] [CrossRef]
Kara, Z.; Aybar, O.; Yucel, M.; Ustundag, B.B. Clustering the Climate: A Machine Learning Approach to Microclimate Zoning and Crop Suitability. In 2025 13th International Conference on Agro-Geoinformatics, Agro-Geoinformatics 2025, Boulder, CO, USA, 7–10 July 2025; IEEE: Piscataway, NJ, USA, 2025. [Google Scholar] [CrossRef]
Rivera, A.J.; Pérez-Godoy, M.D.; Elizondo, D.; Deka, L.; del Jesus, M.J. Analysis of clustering methods for crop type mapping using satellite imagery. Neurocomputing 2022, 492, 91–106. [Google Scholar] [CrossRef]
Janrao, P.; Mishra, D.; Bharadi, V. Clustering Approaches for Management Zone Delineation in Precision Agriculture for Small Farms. In Proceedings of the International Conference on Sustainable Computing in Science, Technology and Management (SUSCOM), Jaipur, India, 26–28 February 2019. [Google Scholar] [CrossRef]
Etumnu, C.; Gray, A.W. A Clustering Approach to Understanding Farmers’ Success Strategies. J. Agric. Appl. Econ. 2020, 52, 335–351. [Google Scholar] [CrossRef]
Pascucci, S.; Carfora, M.F.; Palombo, A.; Pignatti, S.; Casa, R.; Pepe, M.; Castaldi, F. A Comparison between Standard and Functional Clustering Methodologies: Application to Agricultural Fields for Yield Pattern Assessment. Remote Sens. 2018, 10, 585. [Google Scholar] [CrossRef]
Lahza, H.; Naveen Kumar, K.R.; Sreenivasa, B.R.; Shawly, T.; Alsheikhy, A.A.; Hiremath, A.K.; Lahza, H.F.M. Optimization of Crop Recommendations Using Novel Machine Learning Techniques. Sustainability 2023, 15, 8836. [Google Scholar] [CrossRef]
Hernández-Salazar, C.A.; González-Estrada, O.A.; González-Silva, G. Integración de la inteligencia artificial y la agricultura de precisión en cultivos de café. Rev. UIS Ing. 2024, 23, 145–158. [Google Scholar] [CrossRef]
Gómez Arango, A.M. Predicción del Rendimiento de Cultivos Agrícolas en los Cinco Corregimientos de la Ciudad de Medellín, Utilizando Modelos de Machine Learning. Ph.D. Thesis, Universidad EAFIT, Medellín, Colombia, 2024. [Google Scholar]
Sierra Forero, B.L. Modelo para Predicción del Rendimiento de Cultivos de maíz en Colombia Empleando Aprendizaje Profundo. Ph.D. Thesis, Universidad Distrital Francisco José de Caldas, Bogotá, Colombia, 2024. [Google Scholar]
Matta Monroy, N.J. Aplicación de Machine Learning para la Orientación en la Toma de Decisiones Frente al Uso Agrícola Apropiado del Suelo para Zonas con Cultivos Ilícitos en Colombia. Ph.D. Thesis, Universidad Distrital Francisco José de Caldas, Bogotá, Colombia, 2024. [Google Scholar]
Niño Chaparro, G.E.; Niño Chaparro, A.; Chaparro Pesca, J.A. IA en Mercados de Alimentos en Colombia: Usando Machine Learning para Enfrentar Crisis de Precios. Econ. CUC 2024, 45, e24818. [Google Scholar] [CrossRef]
Ministerio de Agricultura y Desarrollo Rural. Evaluaciones Agropecuarias Municipales EVA 2007–2018; Ministerio de Agricultura y Desarrollo Rural: Bogotá, Colombia, 2018. [Google Scholar]
Ministerio de Agricultura y Desarrollo Rural. Evaluaciones Agropecuarias Municipales EVA 2019–2024; Ministerio de Agricultura y Desarrollo Rural: Bogotá, Colombia, 2024. [Google Scholar]
Food and Agriculture Organization of the United Nations. Sustainable Development Goals and the Role of Agriculture; Technical Report; FAO: Rome, Italy, 2017. [Google Scholar]
FAO. World Food and Agriculture–Statistical Yearbook 2022; Technical Report; FAO: Rome, Italy, 2022. [Google Scholar] [CrossRef]
Hidalgo, F.; Birkenberg, A.; Daum, T.; Bosch, C.; Quiñones-Ruiz, X.F. How do coffee farmers engage with digital technologies? A capabilities perspective. Agric. Hum. Values 2024, 41, 1707–1723. [Google Scholar] [CrossRef]
Morán-Figueroa, G.H.; Muñoz-Pérez, D.F.; Rivera-Ibarra, J.L.; Cobos-Lozada, C.A. Model for Predicting Maize Crop Yield on Small Farms Using Clusterwise Linear Regression and GRASP. Mathematics 2024, 12, 3356. [Google Scholar] [CrossRef]
Shitov, A.G.; Yushkevich, L.V.; Yushchenko, D.N.; Zagrebelny, V.E.; Kashinskaya, S.P. Cluster analysis of long-term crop yield data in crop rotation. In Proceedings of the BIO Web of Conferences; EDP Sciences, Ed.; EDP Sciences: Les Ulis, France, 2024; Volume 108, p. 08008. [Google Scholar] [CrossRef]
Wirth, R.; Hipp, J. CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining; The Practical Application Company: Woburn, MA, USA, 2000; Volume 1, pp. 29–39. [Google Scholar]
Yoyram, J.; Janruang, J.; Karnka, S.; Vongpramate, D.; Boonkate, K. Clustering of Tourist Zones and Facilities Using the DBSCAN Algorithm. In 2025 10th International Conference on Business and Industrial Research: Advanced Technology and Innovation for Sustainable Society, ICBIR 2025-Proceedings, Bangkok, Thailand, 22–23 May 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 306–311. [Google Scholar] [CrossRef]
Dewi, K.C.; Ciptayani, P.I.; Ayuni, N.W.D.; Yudistira, I.B.P.S. Modeling Salesperson Performance Based on Sales Data Clustering. 2022 5th International Seminar on Research of Information Technology and Intelligent Systems, ISRITI 2022, Yogyakarta, Indonesia, 8–9 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 390–396. [Google Scholar] [CrossRef]
Díaz-Pulido, J.A. Aplicación de la minería de datos espacial basado en técnicas de agrupamiento para detectar los niveles de congestionamiento del tráfico vehicular en la ciudad de Trujillo. Sciéndo Ingenium 2024, 20, 11–37. [Google Scholar] [CrossRef]
Espinosa-Zúñiga, J.J. Implementation of the CRISP-DM methodology for geographical segmentation using a public database. Ing. Investig. Tecnol. 2020, 21, 17. [Google Scholar] [CrossRef]
Koukaras, P.; Tjortjis, C. Data Preprocessing and Feature Engineering for Data Mining: Techniques, Tools, and Best Practices. AI 2025, 6, 257. [Google Scholar] [CrossRef]
Wu, Y.; Huang, B.; Li, X.; Zhang, Y.; Xu, X. A Data-Driven Approach to Detect Passenger Flow Anomaly under Station Closure. IEEE Access 2020, 8, 149602–149615. [Google Scholar] [CrossRef]
Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice, 3rd ed.; OTexts: Melbourne, Australia, 2021. [Google Scholar]
Hirschman, A.O. The paternity of an index. Am. Econ. Rev. 1964, 54, 761–762. [Google Scholar]
MacQueen, J. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley; University of California Press: Oakland, CA, USA, 1967; Volume 1, pp. 281–297. [Google Scholar]
Herdiana, I.; Kamal, M.A.; Triyani; Estri, M.N.; Renny. A More Precise Elbow Method for Optimum K-means Clustering. arXiv 2025, arXiv:2502.00851. [Google Scholar] [CrossRef]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Kokoska, S.; Zwillinger, D. CRC Standard Probability and Statistics Tables and Formulae, 1st ed.; CRC Press: Boca Raton, FL, USA, 2000; pp. 1–217. [Google Scholar] [CrossRef]
Muñoz, N.D. Clustering-Based Productive Scale Segmentation of Colombian Transitory Crop Production (2007–2024). Zenodo 2026. [Google Scholar] [CrossRef]
Kreitzman, M.; Toensmeier, E.; Chan, K.M.A.; Smukler, S.; Ramankutty, N. Perennial Staple Crops: Yields, Distribution, and Nutrition in the Global Food System. Front. Sustain. Food Syst. 2020, 4, 588988. [Google Scholar] [CrossRef]
Munialo, S.; Siddique, K.H.; Barker, N.P.; Onyango, C.M.; Amissah, J.N.; Wamalwa, L.N.; Qwabe, Q.; Dougill, A.J.; Sibanda, L.M. Reorienting research investments toward under-researched crops for sustainable food systems. Food Energy Secur. 2024, 13, e538. [Google Scholar] [CrossRef]
Piza, C.; Díaz, L.P.; Pulido, N.; Rincón, R.J.D. Agricultura Familiar: Una Alternativa para la Seguridad Alimentaria. Conex. Agropecu. JDC 2016, 6, 13–25. [Google Scholar]
de Castro, C.N. Agricultura Familiar no Brasil, na América Latina e no Caribe: Institucionalidade, Características e Desafios; Instituto de Pesquisa Econômica Aplicada: Brasília, Brazil, 2024. [Google Scholar] [CrossRef]
Palacios Bucheli, V.J. Family farming in Colombia: A pillar for food security and sustainable development. Agron. Colomb. 2025, 43, e120429. [Google Scholar] [CrossRef]
Vantas, K.; Sidiropoulos, E.; Vafiadis, M. Rainfall Temporal Distribution in Thrace by Means of an Unsupervised Machine Learning Method. In Proceedings of the Protection and Restoration of the Environment XIV, Thessaloniki, Greece, 3–6 July 2018. [Google Scholar]
Nurkhofifah, E.; Athina, D.; Tarida, A.R.; Pratiwi, F.A. Clustering of Junior High School Education in West Java Based on Density and Dropout Ratios Using Quartile and KMeans Methods. In Proceedings of the International Conference on Data Science and Official Statistics; Politeknik Statistika STIS: Jakarta, Indonesia, 2025; Volume 2025, pp. 483–511. [Google Scholar] [CrossRef]

Figure 1. Complete methodological pipeline of the study, with CRISP-DM phases mapped to the corresponding stages.

Figure 2. Data preparation stage scheme, adapted from Koukaras and Tjortjis [36].

Figure 3. Correlation matrix among the nine candidate variables for the productive scale clustering.

Figure 4. Annual distribution of records in the aggregated departmental dataset, 2006–2024.

Figure 5. Univariate distributions of sown_area and harvested_area. (a) sown_area in linear scale. (b) sown_area in logarithmic scale. (c) harvested_area in linear scale. (d) harvested_area in logarithmic scale.

Figure 6. Univariate distributions of production and yield. (a) Production in linear scale. (b) Production in logarithmic scale. (c) Yield in linear scale. (d) Yield in logarithmic scale.

Figure 7. Top 10 department–crop pairs by productive indicator, 2007–2024. (a) Cumulative production. (b) Average planted area. (c) Average harvested area. (d) Average yield.

Figure 8. Temporal continuity distribution of department–crop pairs. (a) Observed conditions (36 periods, 2006B–2024B, excluding 2018B). (b) Theoretical scenario with imputation of 2018B (35 periods, 2007A–2024B).

Figure 9. Frequency of transitory crops by number of department–crop pairs. (a) Ten most frequent crops. (b) Ten least frequent crops.

Figure 10. Twenty transitory crops with the highest total cumulative production, 2007–2024.

Figure 11. Spatial coverage of transitory crops, 2007–2024. (a) Top 20 crops with the greatest spatial coverage (number of producing departments). (b) Distribution of all crops by number of producing departments. The red dotted vertical line indicates the maximum number of producing departments (32).

Figure 12. Spatial concentration of transitory crops measured by the Herfindahl–Hirschman Index (HHI), 2007–2024. (a) Top 10 most spatially concentrated crops (highest HHI). (b) Top 10 least spatially concentrated crops (lowest HHI).

Figure 13. Twenty departments with the highest total cumulative transitory crop production, 2007–2024.

Figure 14. Crop diversity by department. (a) Ten departments with the highest number of distinct transitory crops. (b) Ten departments with the lowest crop diversity.

Figure 15. Department–crop presence/absence coverage matrix (32 departments × 96 crops; overall density = 31.5%). (a) Crops with wider spatial coverage (greater number of producing departments). (b) Crops with narrower spatial coverage (fewer producing departments).

Figure 16. Colombian map showing transitory crop diversity by department, 2007–2024.

Figure 17. National total transitory crop production, 2007–2024. (a) Annual production (millions of tonnes) with linear trend. (b) Year-on-year percentage variation in production. Green bars indicate positive year-on-year variation and red bars indicate negative variation.

Figure 18. National total planted area of transitory crops, 2007–2024. (a) Annual planted area (millions of hectares) with linear trend. (b) Year-on-year percentage variation in planted area. Green bars indicate positive year-on-year variation and red bars indicate negative variation.

Figure 19. Number of active department–crop pairs, 2007–2024. (a) Annual count of active pairs with linear trend. The shaded blue area represents the range of active pairs. (b) Year-on-year percentage variation in active pairs. Green bars indicate positive variation and red bars indicate negative variation.

Figure 20. Distribution of productive variables by semi-annual period, 2007–2024. (a) Sown area (ha). (b) Harvested area (ha). (c) Production (t). (d) Yield (t/ha). Box plots show median, interquartile range, and mean (red circle).

Figure 21. Cumulative totals of productive variables by semi-annual period, 2007–2024. (a) Total sown area (ha). (b) Total harvested area (ha). (c) Total production (t). (d) Mean yield (t/ha).

Figure 22. Pearson correlation matrix between productive variables (sown_area, harvested_area, production, and yield).

Figure 23. Residual analysis and linear regression between harvested_area and production. (a) Residuals versus fitted values. (b) Residual distribution. (c) Scatter plot with fitted regression line (y = 1249.71 + 6.03x;

R^{2} = 0.595

). The red dotted line in (a) indicates the zero-residual reference; in (b), it marks the zero value.

Figure 23. Residual analysis and linear regression between harvested_area and production. (a) Residuals versus fitted values. (b) Residual distribution. (c) Scatter plot with fitted regression line (y = 1249.71 + 6.03x;

R^{2} = 0.595

). The red dotted line in (a) indicates the zero-residual reference; in (b), it marks the zero value.

Figure 24. Residual analysis and linear regression between sown_area and harvested_area. (a) Residuals versus fitted values. (b) Residual distribution. (c) Scatter plot with fitted regression line (y = 105.61 + 0.868x;

R^{2} = 0.829

). The red dotted line in (a) indicates the zero-residual reference; in (b), it marks the zero value.

Figure 24. Residual analysis and linear regression between sown_area and harvested_area. (a) Residuals versus fitted values. (b) Residual distribution. (c) Scatter plot with fitted regression line (y = 105.61 + 0.868x;

R^{2} = 0.829

). The red dotted line in (a) indicates the zero-residual reference; in (b), it marks the zero value.

Figure 25. Determination of the optimal number of clusters. (a) Elbow method (within-cluster sum of squares, WCSS). (b) Mean silhouette coefficient for

k = 2, \dots, 6

.

Figure 25. Determination of the optimal number of clusters. (a) Elbow method (within-cluster sum of squares, WCSS). (b) Mean silhouette coefficient for

k = 2, \dots, 6

.

Figure 26. Model validation for

k = 3

(mean silhouette = 0.888). (a) Silhouette plot by productive scale cluster (Small-scale

n = 459

; Medium-scale

n = 26

; Large-scale

n = 5

). (b) Two-dimensional PCA projection with clusters identified by productive scale; PC1 concentrates 99.3% of explained variance.

Figure 26. Model validation for

k = 3

(mean silhouette = 0.888). (a) Silhouette plot by productive scale cluster (Small-scale

n = 459

; Medium-scale

n = 26

; Large-scale

n = 5

). (b) Two-dimensional PCA projection with clusters identified by productive scale; PC1 concentrates 99.3% of explained variance.

Figure 27. Geographic distribution of the 490 department–crop pairs by productive scale, 2007–2024. (a) Small scale (459 pairs, 32 departments). (b) Medium scale (26 pairs, 14 departments). (c) Large scale (5 pairs, 5 departments).

Figure 28. Comparative visualization of productive scale segmentation. (a) Quartile-based classification (fixed cut-offs at the 25th, 50th, and 75th percentiles of average production;

Q_{1}

:

n = 123

;

Q_{2}

:

n = 122

;

Q_{3}

:

n = 122

;

Q_{4}

:

n = 123

). (b) K-Means clustering-based classification (

k = 3

, boundaries at natural discontinuities; Small-scale:

n = 459

; Medium-scale:

n = 26

; Large-scale:

n = 5

). In (b), the green dashed vertical lines indicate the natural breakpoints at approximately 35,386 t and 275,959 t.

Figure 28. Comparative visualization of productive scale segmentation. (a) Quartile-based classification (fixed cut-offs at the 25th, 50th, and 75th percentiles of average production;

Q_{1}

:

n = 123

;

Q_{2}

:

n = 122

;

Q_{3}

:

n = 122

;

Q_{4}

:

n = 123

). (b) K-Means clustering-based classification (

k = 3

, boundaries at natural discontinuities; Small-scale:

n = 459

; Medium-scale:

n = 26

; Large-scale:

n = 5

). In (b), the green dashed vertical lines indicate the natural breakpoints at approximately 35,386 t and 275,959 t.

Table 1. Characteristics of the two EVA source datasets prior to processing.

Characteristic	EVA 2007–2018	EVA 2019–2024
Records	206,068	141,073
Variables	17	18
Temporal coverage	2007–2018	2019–2024
Source	[24]	[25]

Table 2. Consolidated imputation decision table for zero-value cases, organized by group.

Case	Records	Agronomic Interpretation	Decision and Justification
A1	3720	Failed sowing or total crop loss: sown_area > 0 but no harvest or production. Agronomically coherent.	Retain as valid. Contributes sown_area to aggregates without inflating production.
A2	221	Same as A1 but yield recorded as NaN (undefined 0/0 operation).	Set yield = 0. Homogenizes with A1 and eliminates NaN in a variable whose value is known.
A3	24	Failed sowing with positive yield: no harvest or production yet yield is reported. Likely a data entry error.	Correct yield = 0. Adjusting the derived variable is more conservative than imputing harvest.
A4	104	No area or production, yet yield > 0. Positive yield is incompatible with total absence of productive activity.	Correct yield = 0. Prevents spurious values from distorting summaries and models.
B	112	Positive sown_area, harvested_area, and yield, but production = 0. Production must exist; zero indicates an unreported value.	Impute $production = harvested_area \times yield$ . Restores the agronomic identity and prevents underestimation of production.
C2	436	Positive production and yield, but both area variables are zero; areas were not entered.	Impute: $harvested_area = production / yield$ , then $sown_area = harvested_area$ . Exploits the near-proportional relationship between both area variables.
C3	96	Positive sown_area, production, and yield, but harvested_area = 0. Inconsistent: harvested area must be positive if production is reported.	Impute $harvested_area = production / yield$ , retaining sown_area. Restores the agronomic identity.
C5	2278	Positive harvested_area and production, but sown_area = 0. Planted area not recorded for the period.	Impute $sown_area = harvested_area$ . Planted and harvested areas are strongly aligned in transitory crops; total losses are captured in A1–A2.

Cases with zero records (A0, C1, C4) are omitted from this table.

Table 3. Sensitivity analysis of the temporal continuity threshold: pairs retained and production coverage by candidate threshold value.

Threshold (Periods)	Approx. Years	Pairs Retained	Pairs (%)	Production Coverage (%)
10	5.0	688	71.0	99.91
12	6.0	648	66.9	99.68
15	7.5	513	52.9	90.65
18	9.0	477	49.2	90.46
20	10.0	458	47.3	90.42
25	12.5	397	41.0	89.60
30	15.0	350	36.1	88.94

Selected threshold highlighted. Above 15 periods, production coverage stabilizes around 90% while pair coverage declines steadily, confirming 15 periods as the optimal trade-off point.

Table 4. Global descriptive statistics of the aggregated dataset 2007–2024 (18,716 observations; 969 department–crop pairs).

Variable	Min	Q1	Median	Mean	Q3	Max	SD
sown_area (ha)	0.01	23.00	86.00	1738.56	450.00	207,751.00	6852.42
harvested_area (ha)	0.00	20.00	78.00	1055.67	400.00	186,971.00	4234.81
production (t)	0.00	115.00	577.00	11,018.02	3370.00	1,434,720.00	51,635.00
yield (t/ha)	0.00	3.00	7.00	10.50	13.00	102.15	12.30

Table 5. Distribution of department–crop pairs by number of active periods (dataset of 969 pairs, 36 observed periods).

Active Periods Range	N Pairs	Percentage (%)
1–5	203	20.95
6–10	97	10.01
11–15	164	16.92
16–20	54	5.57
21–25	65	6.71
26–30	43	4.44
31–35	91	9.39
36 (complete)	252	26.01
Total	969	100.00

Table 6. Robustness analysis of the productive scale clustering solution: comparison of methods and scaling schemes.

Method/Scheme	Silhouette	ARI vs. Original	Cluster Sizes (L/M/S)	Large-Scale Pairs
Alternative clustering methods
K-Means (original)	0.888	—	5/26/459	5 (reference)
Hierarchical (Ward, $k = 3$ )	0.865	0.728	5/44/441	5 (identical)
GMM ( $k = 3$ )	−0.073	−0.001	Degenerate solution
Scaling sensitivity (K-Means, $k = 3$ )
Original (unscaled)	0.888	—	5/26/459	5
Log1p transformation	0.472	0.079	102/167/221	—
StandardScaler	0.508	0.053	9/108/373	—

L: large scale; M: medium scale; S: small scale. ARI: Adjusted Rand Index (1.0 = perfect agreement; 0.0 = random). GMM solution is degenerate due to violation of multivariate normality assumption. Scaling schemes alter cluster boundaries substantially (ARI < 0.08), confirming that normalization compresses the structural asymmetry the clustering is designed to capture.

Table 7. Silhouette statistics by cluster for the K-Means model with

k = 3

.

Table 7. Silhouette statistics by cluster for the K-Means model with

k = 3

.

Productive Scale	N Pairs	Min. Silhouette	Mean Silhouette	Max. Silhouette
Large	5	−0.474	0.553	0.640
Medium	26	−0.050	0.346	0.547
Small	459	−0.178	0.922	0.959
Global	490		0.888

Table 8. Comparative characterization of the three productive scale clusters obtained from the K-Means model (

k = 3

), including descriptive statistics of average production and average sown_area per cluster.

Table 8. Comparative characterization of the three productive scale clusters obtained from the K-Means model (

k = 3

), including descriptive statistics of average production and average sown_area per cluster.

Scale	N Pairs (%)	Production Mean (t)	Production Min (t)	Production Max (t)	CV (%)	Planted Area Mean (ha)	Avg. Periods
Large	5 (1.0%)	445,073.90	341,163.44	709,922.83	30.6	40,936.51	35.00
Medium	26 (5.3%)	69,873.57	35,925.69	210,753.66	54.0	13,543.21	34.81
Small	459 (93.7%)	2926.08	1.43	34,846.25	197.2	727.43	30.49
Natural breakpoints: ≈35,386 t (small/medium boundary) and ≈275,959 t (medium/large boundary).

CV: coefficient of variation of average production within each cluster.

Table 9. Five large-scale department–crop pairs with clustering variables.

Department	Crop	Avg. Production (t)	Avg. Planted Area (ha)	Avg. Yield (t/ha)	N Periods	Years
Cundinamarca	papa	709,922.83	33,907.14	22.00	35	2007–2024
Boyacá	papa	437,039.39	24,816.73	18.67	35	2007–2024
Casanare	arroz	375,134.74	75,772.73	5.45	35	2007–2024
Tolima	arroz	362,109.09	51,910.69	7.05	35	2007–2024
Nariño	papa	341,163.44	18,275.26	19.45	35	2007–2024

Table 10. Geographic distribution of department–crop pairs by productive scale.

Scale	N Pairs	N Departments	Leading Departments
Large	5	5	Cundinamarca, Boyacá, Nariño, Tolima, Casanare (1 pair each)
Medium	26	14	Meta (4), Boyacá (3), Norte de Santander (3), Bolívar (2), Huila (2), Antioquia (2), Córdoba (2), Cesar (2)
Small	459	32	Cundinamarca (37), Cauca (33), Boyacá (30), Nariño (29), Norte de Santander (28), Santander (27), Valle del Cauca (27)

Table 11. Distribution of department–crop pairs by productive scale and crop.

Scale	N Crops	Leading Crops (N Pairs)
Large	2	Papa (3: Cundinamarca, Boyacá, Nariño); arroz (2: Tolima, Casanare)
Medium	9	Arroz (7), maíz (7), papa (4), tomate (3); cebolla de bulbo, cebolla de rama, soya, patilla, zanahoria (1 each)
Small	56	Frijol (26), maíz (25), ahuyama (23), tomate (22), patilla (22), ají (20), others (301)

Table 12. Cross-tabulation between productive scale (clustering) and quartile-based classification.

Productive Scale	Q1 (Low)	Q2 (Med-Low)	Q3 (Med-High)	Q4 (Large)	Total
Large	0	0	0	5	5
Medium	0	0	0	26	26
Small	123	122	122	92	459
Total	123	122	122	123	490

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Muñoz, N.D.; Barón-Velandia, J.; Vanegas-Ayala, S.-C. Descriptive Analysis and Clustering-Based Productive Scale Segmentation of Colombian Transitory Crop Production: A Departmental-Level Approach. Agriculture 2026, 16, 980. https://doi.org/10.3390/agriculture16090980

AMA Style

Muñoz ND, Barón-Velandia J, Vanegas-Ayala S-C. Descriptive Analysis and Clustering-Based Productive Scale Segmentation of Colombian Transitory Crop Production: A Departmental-Level Approach. Agriculture. 2026; 16(9):980. https://doi.org/10.3390/agriculture16090980

Chicago/Turabian Style

Muñoz, Norbey D., Julio Barón-Velandia, and Sebastian-Camilo Vanegas-Ayala. 2026. "Descriptive Analysis and Clustering-Based Productive Scale Segmentation of Colombian Transitory Crop Production: A Departmental-Level Approach" Agriculture 16, no. 9: 980. https://doi.org/10.3390/agriculture16090980

APA Style

Muñoz, N. D., Barón-Velandia, J., & Vanegas-Ayala, S.-C. (2026). Descriptive Analysis and Clustering-Based Productive Scale Segmentation of Colombian Transitory Crop Production: A Departmental-Level Approach. Agriculture, 16(9), 980. https://doi.org/10.3390/agriculture16090980

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Descriptive Analysis and Clustering-Based Productive Scale Segmentation of Colombian Transitory Crop Production: A Departmental-Level Approach

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Sources

2.2. Data Preprocessing

2.2.1. Record-Level Preprocessing

2.2.2. Departmental Aggregation and Dataset Consolidation

2.3. Feature Engineering and Variable Selection

2.4. Productive Scale Clustering

2.4.1. Algorithm Selection

2.4.2. Data Treatment

2.4.3. Determination of the Optimal Number of Clusters

2.4.4. Model Validation

2.5. Baseline Comparison: Quartile-Based Classification

3. Results

3.1. Exploratory Data Analysis

3.1.1. Overview of the Final Dataset

3.1.2. Analysis by Department–Crop Pair

3.1.3. Analysis by Crop

3.1.4. Spatial Analysis

3.1.5. Temporal Analysis

3.1.6. Correlation Analysis

3.2. Productive Scale Segmentation

3.2.1. Optimal Number of Clusters

Robustness Analysis

3.2.2. Cluster Characterization

3.2.3. Distribution by Department and Crop

3.3. Validation Against Quartile-Based Classification

4. Discussion

4.1. The EDA as an Interpretive Foundation for Clustering

4.2. Large Scale: Structural Stability and Productive Dominance

4.3. Medium Scale: Transitional Productive Corridors

4.4. Small Scale: Territorial Diversity as a Structural Feature

4.5. Clustering Versus Quartile Classification: A Methodological Argument

4.6. Replicability and Broader Methodological Applicability

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI