Next Article in Journal
The Long-Term Annual Datasets for Azov Sea Basin Ecosystems for 1925–2024 and Russian Sturgeon Occurrences in 2000–2024
Previous Article in Journal
A Comprehensive Data Maturity Model for Data Pre-Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Data Descriptor

Orange Leaves Images Dataset for the Detection of Huanglongbing

by
Juan Carlos Torres-Galván
1,
Paul Hernández Herrera
1,
Juan Antonio Obispo
2,
Xocoyotzin Guadalupe Ávila Cruz
2,
Liliana Montserrat Camacho Ibarra
2,
Paula Magaldi Morales Orosco
2,
Alfonso Alba
1,3,
Edgar R. Arce-Santana
1,3,
Valdemar Arce-Guevara
1,3,
J. S. Murguía
1,3,
Edgar Guevara
1 and
Miguel G. Ramírez-Elías
1,*
1
Facultad de Ciencias, Universidad Autónoma de San Luis Potosí, Av. Chapultepec 1570, Privadas del Pedregal, San Luis Potosí 78295, Mexico
2
Comité Estatal de Sanidad Vegetal de San Luis Potosí, Rioverde 796133, Mexico
3
Laboratorio Nacional-Centro de Investigación, Instrumentación e Imagenología Médica, Facultad de Ciencias, Universidad Autónoma de San Luis Potosí, Av. Chapultepec 1570, Privadas del Pedregal, San Luis Potosí 78295, Mexico
*
Author to whom correspondence should be addressed.
Data 2025, 10(5), 56; https://doi.org/10.3390/data10050056
Submission received: 24 March 2025 / Revised: 18 April 2025 / Accepted: 22 April 2025 / Published: 23 April 2025

Abstract

:
In agriculture, machine learning (ML) and deep learning (DL) have increased significantly in the last few years. The use of ML and DL for image classification in plant disease has generated significant interest due to their cost, automatization, scalability, and early detection. However, high-quality image datasets are required to train robust classifier models for plant disease detection. In this work, we have created an image dataset of 649 orange leaves divided into two groups: control (n = 379) and huanglongbing (HLB) disease (n = 270). The images were acquired with several smartphone cameras of high resolution and processed to remove the background. The dataset enriches the information on characteristics and symptoms of citrus leaves with HLB and healthy leaves. This enhancement makes the dataset potentially valuable for disease identification through leaf segmentation and abnormality detection, particularly when applying ML and DL models.
Dataset: DOI: 10.17632/jgkh2jxbwt.1. URL: https://data.mendeley.com/datasets/jgkh2jxbwt/1 (accessed on 22 April 2025)
Dataset Licence: CC-BY-4.0

1. Summary

One of the most challenging problems currently facing the global citrus industry is the Huanglongbing (HLB) disease, also known as citrus greening [1]. The disease is associated with three different variants of the Candidatus Liberibacter (Clas) [2,3,4]. Mexico is the fifth largest producer of oranges globally, with the region of San Luis Potosí representing the third most significant contributor within the country [5,6].
Infected trees display a range of leaf characteristics, including the formation of blotchy mottles, hardening of the leaves, growth in the form of rabbit ears, development of nutrient deficiency, and the formation of veins. These changes can result in fruit that is 1 cm smaller, lopsided, lighter in color, and exhibiting an inversion of color [2,4,7].
To the best of our knowledge, a limited number of publicly available datasets contain HLB-infected and healthy leaves, as shown in Table 1. Two of the most popular datasets are the Citrus Diseases image gallery [8] and the Plant Village Dataset [9]. There are other similar databases; however, those are not public or are focused on other diseases, thus are not included in the table [10,11].
The Citrus Diseases gallery has a total of 127 images, including 103 images of leaves affected by various citrus diseases and nutritional deficiencies such as canker, scab, citrus chlorotic dwarf virus, citrus stubborn disease, Tristeza virus, Mg-deficiency, N-deficiency, Zn-deficiency, boron deficiency, and only 21 HLB images. It can be noted that this dataset does not include any healthy leaf images. The Plant Village Dataset has 5507 orange leaf images. Although extensive, it includes only images of HLB-infected and does not contain healthy leaf images.
Rauf et al. [12] published a dataset of 609 leaf images, including 58 healthy and 551 infected with diseases such as Black spots, Canker, Scab, Greening, and Melanose; however, this dataset does not include HLB-infected leaves. Gómez-Flores et al. [13] published a 953-image dataset, including 100 healthy leaves images, 810 from twelve different nutritional deficiencies, and only 43 HLB-infected leaves. Additionally, two datasets are available on Kaggle. One dataset [14] includes images of 184 healthy leaves and 190 HLB-infected leaves. The other dataset (Roboflow repository) [15] has 646 healthy, 2069 black spots, 56 canker, and 285 HLB-infected leaves images. However, the Kaggle datasets do not provide details about the acquisition methodology and characteristics of the images. Compared with the available datasets, our dataset contains a large number of leaf images from both healthy leaves and those infected with HLB, including detailed specifications in the methodology of image acquisition. To the best of our knowledge, this is the most complete dataset acquired in Latin America. We believe this dataset will become a valuable resource for developing new machine-learning applications for plant disease detection.

2. Data Description

The dataset includes 649 orange leaves, which can be classified into two categories: HLB and Control. The first category (HLB) includes 270 images of leaves with symptoms of HLB.
The second category (Control) comprises 379 images of leaves without symptoms of HLB from healthy trees. These images serve as a control group for the dataset.
The dataset is presented in two formats: (1) Raw data with a white background and (2) processed data using image standardization to minimize the influence of the background and ensure a consistent frame of reference for comparison. It is possible to see some images from the database of leaves of orange trees in Figure 1.

3. Methods

This study was conducted in San Luis Potosi state, Mexico (Ciudad Fernández and Rioverde), where 4.7% of the production value from the state [10] is orange growing, the most important crop in the region.
With the assistance of the technical experts of the Plant Health Committee, the orchards were delineated to collect a sample of the orange leaves. A sample was obtained from each orchard, with one sample acquired for each hectare of orchard. Before this, the category to which each tree belonged was identified by utilizing the information recorded in the orchard and the data previously obtained from producers and the same committee. Special attention was paid to ensuring that the trees were free of any other diseases that could affect them.
Once a tree was selected, the leaf was cut and stored in a zip lock bag to preserve it in good condition. The leaf was then photographed in a controlled environment, with the camera positioned in front of a white sheet of paper to limit external elements’ influence, as shown in Figure 2. The photographs were acquired using cameras on several mobile phones, as described in Table 2.
Image standardization
To minimize the influence of the background in the photos and ensure a consistent frame of reference for comparison, we standardized the images using the methodology outlined in Figure 3, which consists of seven main steps:
Step 1 (convert RGB to HSV): The RGB color model combines red, green, and blue components to define a color. However, using this representation directly for leaf segmentation can be challenging due to how these components interact, especially under varying lighting conditions. In contrast, in the HSV color model, the hue channel represents the color itself (Figure 1: Step 1 Hue Channel), while the value represents brightness (Figure 3: Step 1 Value Channel). This separation makes it easier to distinguish objects based on color, regardless of lighting variations.
Step 2 (color-based segmentation): From visual inspection, the Hue and Value (Brightness) channels are effective for background removal, as the background typically exhibits higher Hue and Value values than the leaf. This is due to the leaf’s pigmentation (yellow and green) corresponding to Hue values in the range [30,130], while the background is generally brighter. Based on this observation, an Otsu thresholding method [11] was applied to automatically determine segmentation thresholds for each channel. The segmentation is defined as
m a s k x , y = 1 ,     i f   H u e x , y > 30   a n d   H u e x , y < T h u e   a n d   V a l u e   ( x ,   y ) 0 ,     o t h e r w i s e
s e g m e n t e d i m a g e x , y = R G B x , y m a s k ( x , y )
where Hue and Value are the respective channels, and Thue is the thresholds obtained via Otsu.
Figure 3 (Step 2) illustrates an example of this segmentation, demonstrating effective background removal. However, in some cases, small background structures remain detectable, or the thresholds may cause under-segmentation, resulting in missing portions of the leaf. Then, to address these issues, manual threshold adjustments were made in case of segmentation errors, incorporating the Saturation channel to enhance segmentation control. Finally, the largest connected component is selected as the leaf from the raw segmentation.
Step 3 (fill holes): The raw segmentation may contain holes due to variability in color and brightness. To address this, an algorithm is applied to fill the holes, ensuring the segmentation encompasses the entire leaf. Figure 3 (Step 3) illustrates an example where the holes in the raw segmentation are corrected, resulting in a complete leaf segmentation.
Step 4 (Leaf direction): Principal Component Analysis (PCA) is a statistical technique used to identify orthogonal vectors that capture most of the variance (information) in the data. It is commonly applied to reduce data dimensionality while preserving variance. In this study, PCA is used to determine the direction of maximum variability of the leaf, which corresponds to its orientation. To achieve this, the coordinates (x,y) of each pixel within the segmentation are extracted, and PCA is applied. The first principal component represents the primary direction of the leaf’s variability, which is taken as its main orientation. Additionally, the mean of the coordinates in each dimension. Figure 3 (Step 4) illustrates the detected center of the leaf. The blue arrow represents the direction of the first principal component (main direction), while the second arrow corresponds to the second principal component. This approach effectively and accurately identifies the leaf’s orientation.
Step 5 (RGB background removal): To remove the background from the RGB image, the segmentation mask obtained in Step 3 (with intensity values of 1 for the leaf and 0 for the background) is multiplied elementwise with the original RGB image. This operation retains only the leaf region while setting the background to zero. Figure 3 (Step 5) shows the result, where the background has been successfully removed from the original RGB image.
Step 6 (Align principal component with the x-axis): The primary principal component (blue arrow) forms an angle θ with respect to the x-axis. To align it, the RGB image (with the background removed) is rotated clockwise around the center, as determined in Step 4 by angle θ . The result is an image where the principal component (indicated by a blue arrow) aligns with the x-axis, as shown in Figure 3 (Step 6).
Step 7 (Final image): The final step involves detecting the bounding box that encloses only the leaf. In Figure 3 (Step 6), the bounding box is represented by a yellow rectangle. The final cropped image is presented in Figure 3 (Step 7).

4. Conclusions

The dataset described in this paper includes 649 images of orange leaves divided into HLB-infected (n = 270) and Healthy (n = 379). This dataset addresses a gap in HLB research by providing standardized, background-removed images of orange leaves from Mexico. It is the second largest public database for HLB detection which includes images from healthy leaves. This dataset can contribute to training new machine learning or deep learning classifiers for HLB detection, particularly in the early stages. Also, it allows the performance of comparative studies of HLB based on the geographic region. Image standardization can also contribute to the prepossessing pipeline for future HLB image analysis.
Despite our dataset including a small number of HLB images compared to other public datasets, the fact that it includes healthy images, balanced classes, a controlled acquisition process, and a preprocessing stage increases its reliability for the development of classifier models. Considering the variability of image acquisition across multiple smartphone cameras and with high resolution when compared with other similar databases enhances its potential for applications in a real-world scenario. In addition to classification tasks, this dataset can be used for applications such as segmentation for leaf morphology analysis, detecting anomalies associated with HLB symptoms or nutritional deficiency, and disease progression monitoring to complement diagnosis. Moreover, our dataset will be expanded as new images become available.
The authors recognize that the most accurate method for determining HLB’s presence is the Quantitative Real-time Polymerase Chain Reaction (qPCR). However, the cost associated with the synthesis of probes for 649 distinct tree species is prohibitively high, limiting the feasibility of this approach. For this reason, we collaborate with the Plant Health Committee, which provides technical experts who monitor pests and diseases in the region daily. The areas where HLB is present and the areas where the trees are healthy. Most plants have been diagnosed by the government secretariat focused on disease control. For those plants that were not diagnosed by qPCR, the experts were confident in indicating which ones had symptoms of HLB and which were healthy. A further limitation of this study is that the images could not be acquired in situ due to the influence of external factors such as sunlight, shadows, and brightness, which affected their quality.
Future work will include expanding the dataset with in-field images and qPCR-validated samples, integrating multispectral or hyperspectral imaging for multimodal analysis, and developing classifier models optimized for mobile environments toward a smartphone-based acquisition methodology.
In summary, our dataset addresses an important gap in HLB disease research, offering a foundation for both academic and applied model innovations in agricultural diagnostics, offering more than 250 images of each class acquired with standardization and in high resolution.

Author Contributions

Conceptualization: J.C.T.-G. and M.G.R.-E.; Data curation: J.C.T.-G., J.A.O., X.G.Á.C., L.M.C.I. and P.M.M.O.; Formal analysis: J.C.T.-G. and P.H.H.; Funding acquisition: J.C.T.-G. and M.G.R.-E.; Investigation: J.C.T.-G., M.G.R.-E. and P.H.H.; Methodology: J.C.T.-G. and P.H.H.; Project administration: J.C.T.-G.; Resources: J.C.T.-G., J.A.O., X.G.Á.C., L.M.C.I. and P.M.M.O.; Software: J.C.T.-G. and P.H.H.; Supervision: J.C.T.-G. and M.G.R.-E.; Validation: P.H.H.; Visualization: J.C.T.-G., M.G.R.-E. and P.H.H.; Writing—original draft: J.C.T.-G.; Writing—review and editing: J.C.T.-G., P.H.H., V.A.-G., M.G.R.-E., E.G., J.S.M., E.R.A.-S. and A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been made possible partially by grant number 2023-329644 from the Chan Zuckerberg Initiative DAF, an advised fund of the Silicon Valley Community Foundation. The research also receives funding from the “Consejo Nacional de Humanidades, Ciencias y Tecnologías” (CONAHCYT) postdoctoral fellowship 4630373 and for the National System of Researchers (SNII) under 346243.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Repository name: Orange Leaves Images Dataset for the Detection of Huanglongbing. Data identification number: DOI: 10.17632/jgkh2jxbwt.1. Direct URL to data: https://data.mendeley.com/datasets/jgkh2jxbwt/1 (accessed on 22 April 2025).

Acknowledgments

The authors would like to express their gratitude to “Secretaría de Desarrollo Agropecuario y Recursos Hidráulicos (SEDARH)”, especially to Noel Isaí Pérez Robles, and “Comité Estatal de Sanidad Vegetal of San Luis Potosí” for their invaluable support. This work has been made possible partially by grant number 2023-329644 from the Chan Zuckerberg Initiative DAF, an advised fund of the Silicon Valley Community Foundation. JCTG also extends acknowledgments to “Consejo Nacional de Humanidades, Ciencias y Tecnologías” (CONAHCYT) for postdoctoral fellowship 4630373 and for the National System of Researchers (SNII) under 346243.

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:
HLBHuanglongbing
MLMachine Learning
DLDeep Learning
ClasCandidatus Liberibacter
RGBRed Green Blue
HSVHue Saturation Value
PCAPrincipal Component Analysis
qPCRquantitative real-time polymerase chain reaction

References

  1. Silva, J.R.D.; Boaretto, R.M.; Lavorenti, J.A.L.; dos Santos, B.C.F.; Coletta-Filho, H.D.; Mattos, D. Effects of Deficit Irrigation and Huanglongbing on Sweet Orange Trees. Front. Plant Sci. 2021, 12, 731314. [Google Scholar] [CrossRef] [PubMed]
  2. EPPO Global Database. Available online: https://gd.eppo.int/ (accessed on 21 June 2024).
  3. Home. Available online: https://www.cabi.org/ (accessed on 21 June 2024).
  4. Bové, J.M. Huanglongbing: A Destructive, Newly-Emerging, Century-Old Disease of Citrus. J. Plant Pathol. 2006, 88, 7–37. [Google Scholar]
  5. Producción de Cítricos en México. Available online: http://www.gob.mx/publicaciones/articulos/produccion-de-citricos-en-mexico?idiom=es (accessed on 2 July 2024).
  6. Secretaría de Agricultura y Desarrollo Rural. México, Quinto Productor Mundial de Naranja. Available online: http://www.gob.mx/agricultura/articulos/mexico-quinto-productor-mundial-de-naranja (accessed on 3 November 2024).
  7. Floyd, J.; Krass, C. New Pest Response Guidelines: Citrus Greening Disease; Animal and Plant Health Inspection Service: Riverdale, MD, USA, 2008. [Google Scholar]
  8. Citrus Diseases Image Gallery. Available online: https://idtools.org/citrus_diseases/ (accessed on 20 June 2024).
  9. PlantVillage Dataset. Available online: https://www.kaggle.com/datasets/emmarex/plantdisease (accessed on 5 July 2024).
  10. Syed-Ab-Rahman, S.F.; Hesamian, M.H.; Prasad, M. Citrus disease detection and classification using end-to-end anchor-based deep learning model. Appl. Intell. 2022, 52, 927–938. [Google Scholar] [CrossRef]
  11. Qiu, R.-Z.; Chen, S.-P.; Chi, M.-X.; Wang, R.-B.; Huang, T.; Fan, G.-C.; Zhao, J.; Weng, Q.-Y. An automatic identification system for citrus greening disease (Huanglongbing) using a YOLO convolutional neural network. Front. Plant Sci. 2022, 13, 1002606. [Google Scholar] [CrossRef] [PubMed]
  12. Rauf, H.T.; Saleem, B.A.; Lali, M.I.U.; Khan, M.A.; Sharif, M.; Bukhari, S.A.C. A citrus fruits and leaves dataset for detection and classification of citrus diseases through machine learning. Data Brief 2019, 26, 104340. [Google Scholar] [CrossRef] [PubMed]
  13. Gómez-Flores, W.; Garza-Saldaña, J.J.; Varela-Fuentes, S.E. CitrusUAT: A dataset of orange Citrus sinensis leaves for abnormality detection using image analysis techniques. Data Brief 2024, 52, 109908. [Google Scholar] [CrossRef] [PubMed]
  14. Citrus Leaves Images Divided in Huanglongbing (HLB) Infected and Healthy. Available online: https://www.kaggle.com/datasets/oarcanjomiguel/citrus-greening (accessed on 5 July 2024).
  15. Dimitra Citrus Dataset Dataset > Overview. Available online: https://universe.roboflow.com/dimitra-el1gk/citrus-dataset-hvhep (accessed on 5 July 2024).
Figure 1. Images from the database. The images of the top row correspond to leaves with symptoms of HLB, while the images of the bottom row are from healthy leaves of orange trees. The images of panels (a,c,e,g) are taken with the camera of different smartphones, while panels (b,d,f,h) are from the processed data with a standardization that minimizes the influence of the background.
Figure 1. Images from the database. The images of the top row correspond to leaves with symptoms of HLB, while the images of the bottom row are from healthy leaves of orange trees. The images of panels (a,c,e,g) are taken with the camera of different smartphones, while panels (b,d,f,h) are from the processed data with a standardization that minimizes the influence of the background.
Data 10 00056 g001
Figure 2. Flow chart of the database acquisition with leaves of orange trees showing symptoms of HLB and those that are healthy.
Figure 2. Flow chart of the database acquisition with leaves of orange trees showing symptoms of HLB and those that are healthy.
Data 10 00056 g002
Figure 3. Workflow illustrating the seven main steps for image standardization: (1) convert the image from the RGB color model to HSV, (2) segment the leaf using a thresholding method applied to each HSV channel, (3) fill holes to ensure the entire leaf is detected, (4) identify the principal direction of the leaf, (5) remove the background from the RGB image using the segmentation, (6) align the leaf with the x- and y-axes based on its principal direction, and (7) crop the image to a bounding box that tightly encloses the leaf.
Figure 3. Workflow illustrating the seven main steps for image standardization: (1) convert the image from the RGB color model to HSV, (2) segment the leaf using a thresholding method applied to each HSV channel, (3) fill holes to ensure the entire leaf is detected, (4) identify the principal direction of the leaf, (5) remove the background from the RGB image using the segmentation, (6) align the leaf with the x- and y-axes based on its principal direction, and (7) crop the image to a bounding box that tightly encloses the leaf.
Data 10 00056 g003
Table 1. The table illustrates public databases; however, it exclusively displays the number of images that comprise each dataset of HLB and healthy classes, despite the existence of additional categories.
Table 1. The table illustrates public databases; however, it exclusively displays the number of images that comprise each dataset of HLB and healthy classes, despite the existence of additional categories.
Database Name/AuthorReferenceHLB ImagesHealthy ImagesTotal ImagesMinimum ResolutionMaximum Resolution
Citrus Diseases Image Gallery[8]210127496 × 3971882 × 1201
Plant Village Dataset[9]550705507256 × 256256 × 256
Rauf et al.[12]20458262256 × 256256 × 256
Gómez-Flores et al.[13]431001434128 × 30964128 × 3096
Citrus leaves images divided in Huanglongbing (HLB) infected and healthy[14]190184374720 × 12804032 × 2268
Roboflow repository[15]285646931640 × 640640 × 640
Our datasetThis work2703796491800 × 40004624 × 3468
Table 2. Smartphone, images, and camera features. The characteristics of each cellphone utilized to capture the images of the leaves, as well as the number of images acquired by each camera, are delineated in the table, along with the average and standard deviation of contrast and percentage of luminescence.
Table 2. Smartphone, images, and camera features. The characteristics of each cellphone utilized to capture the images of the leaves, as well as the number of images acquired by each camera, are delineated in the table, along with the average and standard deviation of contrast and percentage of luminescence.
Cellphone UsedCamera FeaturesImages Acquired by Cellphone
HealthyHLBTotalImage ResolutionContrastLuminiscence (%)
iPhone 1312MP ƒ/1.6 aperture450541800 × 4000 45.77   ±   1.531 69.26 ± 0.25
Motorola Edge 40 Neo50 MP wide-angle, ƒ/1.8 aperture203322354624 × 3468 55.54   ±   9.07865.0 ± 5.42
Xiaomi Poco C6550 MP, ƒ/1.8 aperture2177984624 × 3468 53.59   ±   11.53 68.71 ± 3.91
Samsung Galaxy A3264 MP, ƒ/1.8100611612084 × 4624 49.95   ±   12.17 67.58 ± 5.06
Samsung Galaxy A5264 MP, ƒ/1.851501013468 × 4624 39.90   ±   13.857 71.87 ± 4.81
Total379270649---------
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Torres-Galván, J.C.; Hernández Herrera, P.; Obispo, J.A.; Cruz, X.G.Á.; Ibarra, L.M.C.; Orosco, P.M.M.; Alba, A.; Arce-Santana, E.R.; Arce-Guevara, V.; Murguía, J.S.; et al. Orange Leaves Images Dataset for the Detection of Huanglongbing. Data 2025, 10, 56. https://doi.org/10.3390/data10050056

AMA Style

Torres-Galván JC, Hernández Herrera P, Obispo JA, Cruz XGÁ, Ibarra LMC, Orosco PMM, Alba A, Arce-Santana ER, Arce-Guevara V, Murguía JS, et al. Orange Leaves Images Dataset for the Detection of Huanglongbing. Data. 2025; 10(5):56. https://doi.org/10.3390/data10050056

Chicago/Turabian Style

Torres-Galván, Juan Carlos, Paul Hernández Herrera, Juan Antonio Obispo, Xocoyotzin Guadalupe Ávila Cruz, Liliana Montserrat Camacho Ibarra, Paula Magaldi Morales Orosco, Alfonso Alba, Edgar R. Arce-Santana, Valdemar Arce-Guevara, J. S. Murguía, and et al. 2025. "Orange Leaves Images Dataset for the Detection of Huanglongbing" Data 10, no. 5: 56. https://doi.org/10.3390/data10050056

APA Style

Torres-Galván, J. C., Hernández Herrera, P., Obispo, J. A., Cruz, X. G. Á., Ibarra, L. M. C., Orosco, P. M. M., Alba, A., Arce-Santana, E. R., Arce-Guevara, V., Murguía, J. S., Guevara, E., & Ramírez-Elías, M. G. (2025). Orange Leaves Images Dataset for the Detection of Huanglongbing. Data, 10(5), 56. https://doi.org/10.3390/data10050056

Article Metrics

Back to TopTop