Quantitative Analysis of the Labeling Quality of Biological Images for Semantic Segmentation Based on Attribute Agreement Analysis

Xiang, Rong; Yuan, Xinyu; Zhang, Yi; Zhang, Xiaomin

doi:10.3390/agriculture15070680

Open AccessArticle

Quantitative Analysis of the Labeling Quality of Biological Images for Semantic Segmentation Based on Attribute Agreement Analysis

¹

College of Quality and Standardization, China Jiliang University, Hangzhou 310018, China

²

College of Biosystems Engineering and Food Science, Zhejiang University, Hangzhou 310058, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(7), 680; https://doi.org/10.3390/agriculture15070680

Submission received: 20 February 2025 / Revised: 14 March 2025 / Accepted: 21 March 2025 / Published: 22 March 2025

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Semantic segmentation in biological images is increasingly common, particularly in smart agriculture, where deep learning model precision is tied to image labeling quality. However, research has largely focused on improving models rather than analyzing image labeling quality. We proposed a method for quantitatively assessing labeling quality in semantically segmented biological images using attribute agreement analysis. This method evaluates labeling variation, including internal, external, and overall labeling quality, and labeling bias between labeling results and standards through case studies of tomato stem and group-reared pig images, which vary in labeling complexity. The process involves the following three steps: confusion matrix calculation, Kappa value determination, and labeling quality assessment. Initially, two labeling workers were randomly selected to label ten images from each category twice, according to the requirements of the attribute agreement analysis method. Confusion matrices for each image’s dual labeling results were calculated, followed by Kappa value computation. Finally, labeling quality was evaluated by comparing Kappa values against quality criteria. We also introduced a contour ring method to enhance Kappa value differentiation in imbalanced sample scenarios. Three types of representative images were used to test the performance of the proposed method. The results show that attribute agreement analysis effectively quantifies image labeling quality, and the contour ring method improves Kappa value differentiation. The attribute agreement analysis method allows for quantitative analysis of labeling quality based on image labeling difficulty, and Kappa values can also be used as a metric of image labeling difficulty. Dynamic analysis of image labeling variations over time needs further research.

Keywords:

semantic segmentation; image labeling; quality evaluation; attribute agreement analysis; Kappa coefficient

Graphical Abstract

1. Introduction

In the field of computer vision, image semantic segmentation is a significant area of research. Semantic segmentation involves accurately segmenting different categories of objects in an image by classifying the image at the pixel level and assigning each pixel a corresponding semantic category. Image semantic segmentation has diverse applications across various fields, including healthcare [1], autonomous driving [2], aerospace [3], and geography [4]. Moreover, the ongoing advancement of semantic segmentation has enabled its application in addressing complex bioengineering challenges [5]. For instance, through the semantic segmentation of plant stems and fruits, it can be utilized in real-time supervision of crop growth status [6,7], disease and pest identification [8,9], crop yield prediction [10], automatic pruning [11], and automatic picking [12]. Another example is the semantic segmentation of individual animals, which proves valuable in animal body size monitoring [13], animal behavior recognition [14,15], animal disease diagnosis [16], and animal population statistics [17].

The performance of the biological image semantic segmentation deep learning model depends not only on the deep learning model itself, but also on the data quality of training sets. Training deep learning models for semantic image segmentation requires a large number of labeled images. Image labeling, particularly for semantic segmentation, is a tedious and labor-intensive task. To enhance the efficiency of image labeling, large-scale image labeling projects typically involve multiple labeling workers. Unfortunately, different labeling workers have varying interpretations of labeling rules, resulting in discrepancies in the labeling results. Even when the same labeling worker labels the same image multiple times, the labeling results are not consistent. These differences in image labeling results, whether by different workers or the same worker at different times, lead to inconsistent image label quality. In the context of deep learning, image labeling is closely tied to model training quality [18]. However, the current deep learning modeling of biological images mainly focuses on the deep learning model design, ignoring the quality of the datasets used for training. This poses a challenge for researchers and practitioners working with deep learning models for semantic segmentation, particularly in crowdsourcing companies. Therefore, the development of analysis methods for image labeling quality is crucial for ensuring consistent labeling results and improving the training accuracy of deep learning models.

Several studies have been conducted on the analysis of image labeling quality. Xu et al. [19] proposed a framework for deriving high-quality segmentation data and denoising labels by using a small amount of high-quality labels and a large amount of low-quality noisy labels. Label quality was manually differentiated according to worker category. Zhang et al. [20] introduces a framework aimed at improving the quality of crowdsourced labels by leveraging noise correction techniques. It integrates ground truth inference with a novel Adaptive Voting Noise Correction (AVNC) algorithm, which identifies and corrects mislabeled data using ensemble learning models. Tao et al. [21] addressed the issue of varying labeling quality among different workers and proposed a method to combine the specific quality of workers on different instances and overall quality of workers on all instances to estimate the weights of each worker, labeling different instances and aggregating the weights of labels for more accurate true labels. The weights of labels were determined according to label similarity among workers, but the internal labeling agreement of each worker and the influence of random factors was not considered. Moreover, the overall quality of the worker was defined as the classification accuracy of the built classifier trained by a single-label dataset, which was composed of training instances’ feature vectors and the corresponding crowd labels. However, it is time consuming to form the single-label dataset and train the classifier. Pičuljan et al. [22] developed a machine learning-based method to distinguish high-quality labels from low-quality labels and assist labeling workers in focusing on potentially incorrectly labeled images. It is also time consuming to train the machine learning-based method, and the accuracy is only 82%. Wang et al. [23,24] provided a conservative method for eliminating low-quality workers without eliminating any non-low-quality workers, and an aggressive method for eliminating all low-quality workers by majority voting. This method is used to select the high-quality labels and requires that all images must be labeled by more than two workers at the same time, which is time consuming. The labeling quality is contingent to a certain extent, and overall labeling quality is not considered. They also constructed a small model to estimate task difficulty, and improved the label quality by choosing workers with different labeling quality for projects with different difficulties. The method can improve the labeling quality, but only conducted a manual classification of labeling quality by two worker categories. Overall, the existing research primarily focused on label quality identification and improvement with no comprehensive quantitative analysis of the quality of labeling work, and was mainly used for image classification or object detection tasks.

To address the problem of quantitatively analyzing the labeling quality of biological images for semantic segmentation, we proposed a method for quantitative analysis based on attribute consistency analysis, which includes labeling variation analysis and labeling bias analysis. The main objectives of this method were the following: (1) to achieve quantitative analysis of labeling variation, including internal image labeling quality for individual labeling workers, as well as external and overall image labeling quality for multiple labeling workers, and labeling bias between labeling results and standards; (2) to solve the issue of low differentiation in the evaluation of image labeling quality due to an imbalance of positive and negative samples. This study aims to provide insights for improving image labeling quality and is significant for enhancing the overall quality of image labeling.

2. Materials and Methods

2.1. Image Dataset

2.1.1. Biological Images for Semantic Segmentation

The images used for labeling variation analysis in this study were labeled images of tomato plant stems grown in greenhouses and pig individuals reared in herds. The tomato stem profile is slender and topologically complex, with varying diameters and some missing parts due to branch and leaf shading. Labeling these images for semantic segmentation is challenging, with poor agreement and unstable labeling quality. On the other hand, group-reared pigs have larger body sizes and smaller individual differences, making them relatively easy to label for semantic segmentation. The agreement of the labeling results for pig images is better, resulting in more stable labeling quality. These two types of images were selected for image labeling quality analysis because they are representative and facilitate the validation of analytical methods for image labeling quality. According to the requirements of the attribute agreement analysis [25,26,27], ten tomato stem images and ten group-reared pig images were chosen for this study. The tomato stem images vary in distance, morphology, occlusion, and the number of stems. The group-reared pig images vary in posture and the number of pigs.

To analyze labeling bias, bird images were used, which were accessed from the publicly available dataset released at the 2016 IEEE International Conference on Image Processing (ICIP) [28]. Labels in the dataset were treated as labeling standards used in the labeling bias analysis. Sample images are provided in Figure 1.

2.1.2. Image Labeling

(1) Image labeling tools

The image labeling tool used in this study is LabelMe 1.8.1, developed by the Computer Science and Artificial Intelligence Laboratory (CSAIL) at the Massachusetts Institute of Technology (MIT) [29]. LabelMe is a tool for image labeling and dataset creation. It allows the labeling of semantically segmented objects in images using polygons, rectangles, circles, polylines, line segments, and points. The LabelMe interface is user-friendly, providing an intuitive interface for image viewing and labeling [30]. The labeled images can be saved in various formats, such as XML and JSON. The labeling results were saved in JSON format.

(2) Image labeling process and requirements

According to the requirements of the attribute agreement analysis [25,26,27], two labeling workers, A and B, were selected, and each labeling worker performed two rounds of repeated labeling for the same image to meet the repeatability requirement. The image labeling requirements are as follows: During the image labeling period, the two labeling workers do not discuss image labeling results and are asked to do the pixel-level labeling independently according to predetermined image labeling rules. To ensure blind labeling, three months pass between the first and second labeling sessions, ensuring that workers forget the previous labeling results.

The predetermined image labeling rules are as follows: for tomato stem images, only clear tomato stems are labeled, while unclear tomato stems are not labeled. Only the main tomato stems and their side branches are labeled, excluding stems that are occluded by branches, leaves, or obstacles. For images of group-reared pigs, pigs that appear within the captive area in the field of view are labeled, while pigs appearing outside the enclosure are not labeled. The labeling rules of bird images are same as those of group-reared pigs.

In practical applications, the three-month interval between the first and second labeling sessions is too long for immediate evaluation. Due to the large image dataset, labeling workers usually do not recall specific images. Therefore, the blind labeling requirement can be met through the marking and mixing method. The process involves selecting 10 representative images as sample images and creating duplicate copies. These 20 images are marked and mixed into the image dataset for labeling. The workers then label the images according to the predetermined image labeling rules. At the end of the labeling process, 20 marked images are repicked for evaluation.

2.2. Programming Tools and Environment

The computer system consists of an Intel I5 12400F processor (Intel Corporation, Santa Clara, CA, USA), a GeForce RTX3060 graphics card (GALAXY Technology, Hongkong, China), and a Win10 64-bit operating system. CUDA version 10.0 was utilized to develop an image labeling quality analysis program using Python version 3.10.

2.3. Measurement System Analysis

To quantitatively analyze the labeling quality of the biological image for semantic segmentation, Measurement System Analysis (MSA), which is one of the five key tools in quality engineering, was applied, and a Kappa-based attribute agreement analysis method was proposed.

MSA primarily addresses issues related to the quality of measurement data and can analyze the quality of both variables and attributes measurement data. Variables measurement data refer to the measurement data whose values change continuously, while attributes measurement data refer to the data with discrete changes in values. Semantic segmentation image labeling data, where different labeling workers visually classify pixels as either foreground or background, fall into the category of attributes measurement data. Attributes measurement data can be further divided into categorical and ordinal data. Categorical data have no natural order among categories, while ordinal data have a natural order. Since the foreground and background in image labeling do not have a natural order, semantic segmentation image labeling data are categorical data. The quality analysis of categorical data commonly employs the Kappa value method.

In MSA, bias, repeatability, and reproducibility are key to evaluate the measurement data quality [31,32]. In this study, labeling bias refers to the difference between labeling results and standards. Repeatability refers to the variation of the repeated labeling results of a same image by a same labeling worker, measuring the labeling variation caused by an individual labeling worker. Reproducibility refers to the variation of labeling results among different labeling workers, measuring the labeling variation caused by different labeling workers. Repeatability and reproducibility constitute the labeling variation. The image labeling was treated as a kind of measurement, where the task was to visually determine whether a pixel belongs to the foreground or background. The labeling workers, labeling tool, labeled images, labeling method, and labeling environment are considered as components of an attribute measurement system shown in Figure 2. The Kappa value was then used to quantitatively analyze the quality of measurement data, i.e., the image labeling results, focusing on labeling result variations of individual labeling workers, different labeling workers, and all labeling workers.

2.4. Quantitative Analysis Method for Labeling Variation

The method used to analyze the labeling quality of biological images for semantic segmentation involves obtaining two rounds of labeling results from two labeling workers for each image, resulting in four labeling results for each image. The internal, external, and overall labeling quality of the images was analyzed based on attribute agreement analysis. K_A and KB, K_AB and K_A,B were used to analyze the within-labeling-worker agreement, the between-labeling-worker agreement, and the overall labeling agreement separately. Details are shown in Section 2.4.2, Section 2.4.3 and Section 2.4.4. The analysis process consists of three steps: the confusion matrix calculation, Kappa value calculation, and quantitative analysis of image labeling quality based on Kappa values, as shown in Figure 3.

2.4.1. Attribute Agreement Analysis Method Based on Kappa Values

(1): Confusion matrix

The confusion matrix is a valuable assessment tool used in machine learning. It provides a visual representation of the predicted and true class results, covering all possible cases of a classification problem. In binary classification, the confusion matrix has dimensions of 2 × 2. Various performance metrics can be derived from the confusion matrix, such as the positive region check rate and negative class recall rate, which are applicable to all classification algorithms [33]. TP represents correctly predicted positive cases (true cases), TN represents correctly predicted negative cases (true negative cases), FP represents falsely predicted positive cases (false positive cases), and FN represents falsely predicted negative cases (false negative cases) [34].

To analyze discrepancies in the labeling results among labeling workers, this study employs the confusion matrix in conjunction with the MSA methodology to quantify the labeling outcomes as data. Attribute agreement is assessed using Kappa values. In this study, pixels labeled as foreground in both labeling results are considered TP, pixels labeled as foreground in the first result and background in the second result are considered FN, pixels labeled as background in the first result and foreground in the second result are considered FP, and pixels labeled as background in both results are considered TN. The flowchart of the calculation method for the confusion matrix is presented in Figure 4.

The computation process for the confusion matrix is as follows:

(a): Load two pixel-wise labeled binary images (foreground = 1, background = 0);
(b): Iterate over each pixel and compare the two labeled values;
(c): If both are foreground (1,1), increment TP by 1;
(d): If both are background (0,0), increment TN by 1;
(e): If A is background (0) and B is foreground (1), increment FP by 1;
(f): If A is foreground (1) and B is background (0), increment FN by 1;
(g): Compute the final confusion matrix and calculate the Kappa value.

(2): Contour ring

During the image labeling process, an imbalance between the number of labeled foreground and background pixels may arise. This creates a problem of positive and negative sample imbalance, resulting in inflated Kappa values that fail to adequately differentiate disagreements in labeling results. In the four types of images, the number of labeled foreground pixels is significantly smaller than the number of background pixels, meaning that the TP is much smaller than the TN in confusion matrices.

For this purpose, a contour ring was designed. The number of pixels in the contour rings was used to calculate the confusion matrix rather than all pixels in the labeled image.

The extraction method of the contour ring involves the following steps: merging the results of multiple labeling attempts on the same image by two labeling workers; extracting the edge of the foreground region from the merged region as the contour of the foreground region; and obtaining the contour ring through the dilation operation of the contour, as illustrated in Figure 5. The reason behind extracting the contour from the concatenation of labeling results is that this contour region represents the area with the highest disagreement in labeling results. Most of the disagreements occur near the contour of the concatenation, which is where the contour ring is situated.

To ensure the accurate extraction of the contour where the foreground region is in contact with the image boundary, the image boundary pixels are expanded by dilating the background pixels surrounding the image before edge detection. Once the contour ring is obtained, an inverse operation is performed to remove the expanded image boundary and restore the image to its original size.

An adaptive parameterization framework was established through systematic analysis to optimize contour ring configuration, with operational guidelines formulated as follows:

The dilation parameters (convolutional kernel dimensions and iteration counts) were algorithmically modulated according to foreground pixel density (FPD), calculated as follows: FPD = (Foreground Pixels/Total Pixels) × 100.

For images with low-density morphology (FPD < 10%), such as tomato stem images (FPD 2.9–8.6%), the convolutional kernel is 5 × 5, and iterative dilation is 5. For images with high-density morphology (FPD ≥ 10%), such as group-reared pig images, the convolutional kernel is 3 × 3, and iterative dilation is 3.

This approach ensures a relative balance of the total pixel count in the contour ring when calculating Kappa values for images with varying foreground pixel ratios, thereby enhancing the sensitivity and accuracy of Kappa values in distinguishing labeling quality differences across different image types.

Once the contour ring is obtained, the entire image is intersected with the contour ring, and only pixels within the contour ring, representing the interested result, are considered for computing the confusion matrix, as shown in Figure 4. To differentiate between the confusion matrices obtained from considering all pixels in the whole image and those obtained from considering only pixels within the contour ring, they are referred to as the “whole image confusion matrix” and the “contour ring confusion matrix”, respectively. The design of the contour ring addresses the issue of poor differentiation in agreement metrics caused by the imbalance of positive and negative samples in labeled images. This approach enhances the sensitivity of the agreement metrics to disagreements in labeled results.

(3): Kappa values

Kappa, initially proposed by Cohen [35], is widely utilized in assessing the accuracy of remote sensing classifications. Its purpose is to evaluate the agreement between two classified images. In the field of statistics, Kappa serves as a metric to measure the agreement of a classification model. It assesses the degree of agreement between the model’s predictions and the actual observations, taking into account the chance agreement of correct or incorrect predictions in the results. Kappa can also be used to measure the accuracy of classification results when two classifiers are compared [36]. The range of the Kappa values is from −1 to 1, with 1 indicating perfect agreement, 0 indicating random prediction, and −1 indicating complete disagreement.

The formula for calculating the Kappa value is shown in Equation (1).

K a p p a = \frac{P_{o} - P_{e}}{1 - P_{e}} = 1 - \frac{1 - P_{o}}{1 - P_{e}}

(1)

where P_o denotes the observed agreement ratio; P_e denotes the expected agreement ratio; and the calculations of P_o and P_e are shown in Equations (2) and (3), respectively.

P o = \frac{T P + T N}{T P + F N + F P + T N}

(2)

P e = \frac{(T P + F N) (T P + F P) + (F P + T N) (F N + T N)}{{(T P + F N + F P + T N)}^{2}}

(3)

where TP, TN, FP, and FN are given by the confusion matrix.

(4): Quantitative evaluation criteria for image annotation quality

The evaluation standard of the Kappa value in attribute agreement analysis is as follows. If the Kappa is greater than or equal to 0.75, the attribute agreement of the image labeling results is considered good. In other words, the image labeling quality is deemed to be qualified [37].

It should be noted that the stringency of the qualified standard for image labeling quality is influenced by the size of the convolution kernel and the number of dilations during the extraction of the contour ring. When the convolution kernel and the number of dilating layers increase, the total number of pixels also increases, resulting in a more lenient passing standard for image labeling quality. Conversely, when the convolution kernel and the number of dilations are smaller, the total number of pixels decreases, leading to a stricter passing standard for image labeling quality.

2.4.2. Internal Image Labeling Analysis

The internal image labeling quality analysis of each labeling worker was conducted through internal agreement analysis. This analysis evaluates the agreement of multiple repeated labeling results performed by one same labeling worker for one same image using one same labeling tool and following the same labeling rules. According to the requirements of attribute agreement analysis, it is necessary to conduct at least two repeated labeling instances. To reduce the number of repeated labels in practical applications, the number was set to 2. The internal agreement analysis enables the examination of the stability and reliability of each worker’s labeling operation. For each worker, internal agreement analysis is performed separately to calculate the confusion matrix between their repeated labeling results. The process for internal agreement analysis among labeling workers is depicted in Figure 6.

For labeling workers A and B, the internal confusion matrices from two repeated labeling results of the same image, A1 and A2, and B1 and B2, are computed. The Kappa values are then determined and denoted as K_A and K_B, respectively. These values are used to evaluate the internal agreement of each labeling worker. A higher K_A suggests a higher internal image labeling quality for worker A, while a lower K_A indicates a lower quality. The same principle applies to worker B.

2.4.3. External Image Labeling Analysis

External agreement analysis was applied to assess the agreement among the labeling results obtained by different labeling workers for one same image using one same labeling tool and following the same labeling rules. This analysis is crucial for evaluating external image labeling quality. It is essential to have at least two labeling workers to conduct attribute agreement analysis. Therefore, two labeling workers were randomly selected for the analysis of external image labeling quality. The process for external agreement analysis among labeling workers is depicted in Figure 7.

Given that each worker provides two labeling results for each image, the external confusion matrix is computed by intersecting these two labeling results. For workers A and B, A∩ and B∩ are the intersections of A1 ∩ A2 and B1 ∩ B2, respectively, i.e., A∩ = A1 ∩ A2, B∩ = B1 ∩ B2. Subsequently, the external confusion matrix of A∩ and B∩ and the corresponding Kappa value, denoted by K_AB, are determined. The Kappa value K_AB serves as a measure to evaluate the external agreement between two labeling workers.

A higher K_AB signifies a higher external image labeling quality between two workers, whereas a lower K_AB suggests otherwise.

2.4.4. Overall Image Labeling Quality Analysis

The overall image labeling quality of all the labeling results is analyzed through overall agreement analysis. This analysis evaluates the overall agreement of the labeling results obtained by different labeling workers who label one same image multiple times using one same labeling tool and following the same labeling rules. Overall agreement analysis can be used to assess the stability and reliability of the labeling operation performed by all labeling workers at all labeling times. The process is illustrated in Figure 8.

To compute the overall confusion matrix, four labeling results (A1, A2, B1, B2) labeled by each worker for each image are intersected and concatenated. This is represented as A1 ∩ A2 ∩ B1 ∩ B2 = A ∩ B, and A1 ∪ A2 ∪ B1 ∪ B2 = A ∪ B. The overall confusion matrix of the intersection and concatenation (A ∩ B and A ∪ B) and the Kappa value can be calculated, denoted by K_A,B. The Kappa value K_A,B is used to evaluate the overall agreement of all labeling results.

A higher overall agreement of all labeling results indicates a higher overall image labeling quality by two labeling workers. This is reflected by a larger K_A,B value, and vice versa.

2.4.5. Quantitative Analysis Method for Labeling Bias

The method is the same with the internal image labeling analysis method presented in Section 2.4.2. The only difference is that labeling bias is to analyze the agreement between labeling results and labeling standards, but not between the two repeated labeling results of one labeling worker.

By analyzing the Kappa values between the labeling results of each worker and standard labels, a higher Kappa value indicates a higher agreement between workers’ labeling results and the standards, reflecting a higher image labeling quality.

3. Results and Discussion

3.1. Image Labeling Quality Analysis of Labeling Variation

Since Kappa values based on the whole image confusion matrix and Kappa values based on the contour ring confusion matrix exhibit similar trends, this section focuses on analyzing Kappa values based on the contour ring confusion matrix. The findings and conclusions obtained through this analysis can also be applied to Kappa values based on the whole image confusion matrix.

The Kappa values of the labeling results of 10 tomato stem images and 10 group-reared pig images are shown in Table 1.

(1): Internal image labeling quality analysis (K_A and K_B)

In Table 1, worker A demonstrates the lowest labeling quality for the No. 3 image among 10 tomato stem images. Upon scrutiny of the No. 3 image and the two labeling results provided by worker A, as depicted in Figure 9, it becomes evident that the labeling results for four lateral branches are inconsistent across the two repeated labeling results. This illustrates the potential of missing labels on certain lateral branches of tomato plants, consequently resulting in the diminished labeling quality observed in the No. 3 image labeled by worker A. Conversely, worker A’s labeling quality for the No. 1 image is the highest among 10 images. Upon closer inspection of the No. 1 image and the two repeated labeling results by worker A, it is evident that there are fewer side branches in the No. 1 image, and worker A consistently produced similar labeling results for the two side branches across the repeated labeling results. This suggests that the issue of missing labels is less likely to occur for tomato plants with fewer lateral branches.

Worker B, however, shows the lowest labeling quality for the No. 1 image among 10 tomato stem images. An examination of the No. 1 image and the two labeling results provided by worker B shows that worker B initially labeled the occluded branch but failed to label it during the second labeling instance. This highlights the fact that occluded branches are prone to being omitted during the labeling process. On the other hand, worker B achieved the highest labeling quality for the No. 2 image. Upon reviewing the No. 2 image and the two labeling results, it becomes apparent that there is minimal occlusion present in tomato stems in the No. 2 image, and worker B consistently produced matching labeling results across repeated labeling instances. This suggests that tomato images with less stem occlusion are relatively easy to obtain high-quality labeling results.

For worker A, there are 4 images with Kappa values K_A above 0.75 among 10 tomato stem images, resulting in a 40% qualified rate for worker A’s tomato stem image labeling. Similarly, for worker B, there are 5 images with Kappa values K_B above 0.75, indicating a qualified rate of 50% for tomato stem images.

In Table 1, it is evident that worker A has the lowest labeling quality for the No. 15 image among 10 group-reared pig images. According to the No. 15 image and the two labeling results by worker A, as shown in Figure 9, there are disagreements in the labeling results of the head and foot parts of pigs, along with a black stain on the foot resembling the color of the pigpen ground. These findings indicate that certain pigs are prone to inconsistent labeling of the head and foot parts, particularly in images where pigs are in a curled-up position or where stains are present. Conversely, worker A exhibits the highest quality of labeling for the No. 11 image. After carefully observing the No. 11 image and the two repeated labeling results obtained by worker A, it can be concluded that the pig is in a lying-down posture, and the labeling results obtained by worker A remain fairly consistent. This suggests that when the pig assumes a lying-down posture, the labeling agreement is high due to the clear display of body parts. The Kappa values for each image labeled by worker B differ by no more than 0.045, indicating that the labeling quality remains stable.

Worker A achieves Kappa values K_A above 0.75 in 7 images, resulting in a 70% qualified rate for worker A’s group-reared pig image labeling. Worker B, on the other hand, obtains Kappa values above 0.75 in all 10 images, indicating a qualified rate of 100% for group-reared pig images.

To systematically evaluate the impact of image category variation on experimental validation, we implemented a methodological approach focusing on bird images (the third category) for labeling quality assessment, with quantitative outcomes detailed in Table 2. The qualification rates for workers A and B demonstrate distinct patterns across image categories. For bird images, both workers attained maximum qualification rates (100%). Comparative analysis reveals that bird images exhibited the highest labeling quality, indicating the lowest labeling difficulty for this category. Conversely, tomato stem images showed the poorest labeling performance across metrics, with corresponding data confirming them as the most challenging category for accurate annotation. This inverse relationship between labeling quality and task difficulty highlights significant variation in annotation complexity across different image types.

(2): External image labeling quality analysis (K_AB)

In Table 1, for tomato stem images, the external image labeling quality of the No. 8 image, labeled by workers A and B, is the poorest. According to the four labeling results provided by workers A and B in Figure 9, worker B labeled two main stems outside the center of the field of view, which worker A failed to do. It indicates a disparity in the understanding of labeling rules between workers A and B. These two workers exhibit the highest external image labeling quality for the No. 10 image. Upon scrutinizing the four labeling results in Figure 9, it is evident that there is only one tomato plant with a clear stem, and the labeling results obtained by the two workers align closely. Consequently, it can be inferred that the image labeling quality is superior for tomato stem images where there is only one distinct main stem.

Regarding tomato stem images, 4 images possess K_AB values above 0.75 among 10 images, resulting in a qualified rate of 40% for external labeling quality.

In Table 1, for group-reared pig images, the discrepancy among Kappa value K_AB lies within 0.067, indicating stable external image labeling quality. Ten images have K_AB values above 0.75, achieving a qualified rate of 100% for external labeling quality.

To systematically investigate potential disparities in labeling expertise, we implemented a three-worker framework incorporating worker C—an established expert utilizing validated bird image labels from the IEEE ICIP 2016 dataset. These peer-reviewed labels, widely recognized as benchmark references in semantic segmentation research, served as our methodological ground truth. The experimental cohort comprised the following: (1) worker C (expert-level proficiency), (2) worker B (intermediate familiarity with diagnostic imaging), and (3) worker A (baseline annotation capability), collectively representing a stratified expertise continuum.

Quantitative analysis (Table 2) employed Kappa coefficient calculations through dual analytical approaches: whole-image evaluation and contour-ring-specific assessment. External labeling quality comparisons (K_AB, K_AC, K_BC) revealed non-significant differentiation across expertise levels via two-sample t-testing (whole-image p-values: 0.920, 0.919, 0.847; contour-ring p-values: 0.730, 0.289, 0.438; α = 0.05). This statistical uniformity suggests comparable external labeling quality between workers, potentially attributable to the fundamental ease of bird labeling tasks, enabling high consensus across proficiency levels.

(3): Overall image labeling quality analysis (K_A,B)

For tomato stem images, Table 1 presents an assessment of the overall image labeling quality conducted by workers A and B, with the No. 3 image yielding the poorest labeling quality. A careful examination of the four labeling iterations from workers A and B in Figure 9 reveals inconsistencies in the labeling of two side branches by worker A, as well as inconsistencies in the labeling of six side branches by worker B. Additionally, worker A labeled one stem, while worker B labeled three stems. These findings suggest that the problem of labeling omission is more likely to occur for certain lateral branches of the tomato plant. Moreover, for main stems located off-center the field of view, the existing labeling rules tend to result in different interpretations among workers. Consequently, the overall image labeling quality for the No. 3 image is considerably poor. On the other hand, the highest overall image labeling quality is achieved for the No. 10 image. The No. 10 image features only one lateral branch, and the labeling results of workers A and B coincide. Thus, no labeling discrepancies exist in images with only one lateral branch on the tomato plant.

None of the images exhibit K_A,B values above 0.75, indicating that after inconsistencies are identified between the internal and external labeling results, the overall image labeling quality for tomato stem images is deemed unsatisfactory.

Similarly, for group-reared pig images, Table 1 shows that the No. 16 image ranks as the image with the poorest overall image labeling quality. An examination of the four labeling iterations in Figure 9 reveals that there is a discrepancy in labeling between worker A and worker B for pigs situated at the edge of the pigpen with few pixels. This discrepancy emerges from worker B labeling the pigs, while worker A fails to do so. Thus, the omission problem tends to occur for pigs appearing at the image’s periphery. Conversely, workers A and B exhibit the highest overall image labeling quality for the No. 11 image. Examining the No. 11 image and the four labeling results by workers A and B, the pig in the No. 11 image is in a lying posture, and the labeling results are virtually identical. This finding suggests that when a pig is in a lying posture, the overall image labeling quality is good.

None of the images attained K_A,B values above 0.75, indicating that the overall image labeling quality fails to meet the requisite standards for group-reared pig images after inconsistencies are identified between the internal and external labeling results.

To further validate methodological robustness, we conducted supplementary experiments utilizing 20 bird images. Statistical comparisons between Kappa coefficients derived from the first 10-image set and the 20-image cohort were performed using two-sample independent t-tests (Table 2). The analysis revealed p-values consistently surpassing the standard significance threshold (α = 0.05), demonstrating statistically equivalent metric distributions regardless of sample size variation. This empirical evidence confirms that sample size variations do not compromise the validity of our conclusions within the tested numerical range.

3.2. Image Labeling Quality Analysis of Labeling Bias

Regarding the same labeling error deviation at every time, labeling bias, i.e., the agreement between labeling results and labeling standards, was analyzed. Although no universal standards exist in image labeling, where each pixel is correctly classified as foreground or background, the high-quality image labeling results can be regarded as labeling standards. Therefore, to analyze the agreement between labeling results labeled by two workers and labeling standards, bird image labels were treated as labeling standards, which were accessed from the publicly available dataset released at the 2016 IEEE International Conference on Image Processing (ICIP) [28]. The labeling results of workers A and B were compared to labeling standards to analyze the same labeling bias. The results are shown in Table 2.

In Table 2, the Kappa values of worker B are higher than those of worker A for 14 of 20 images, indicating that worker B has better image labeling agreement with the standards. This means that worker B’s image labeling quality is relatively higher. This finding is consistent with the conclusions from the internal image labeling quality analysis.

3.3. Comparative Analysis Between Whole Image Confusion Matrix and Contour Ring Confusion Matrix

Comparing Kappa values based on the whole image confusion matrix and contour ring confusion matrix presented in Table 1, it can be observed that for the tomato stem images, the Kappa values obtained from the whole image confusion matrix are in the range of [0.652, 0.983] with an interval width of 0.331. On the other hand, the Kappa values obtained from the contour ring confusion matrix are in the range of [0.320, 0.941], with an interval width of 0.621. For the group-reared pig images, the Kappa values obtained from the whole image confusion matrix fall within the range of [0.967, 0.994], with an interval width of 0.027. Conversely, the Kappa values obtained from the contour ring confusion matrix lie within the range of [0.572, 0.857], with an interval width of 0.285. These results indicate that Kappa values computed based on contour ring confusion matrices have a better discriminatory ability and higher sensitivity to inconsistencies in labeling results.

We also quantitatively assessed the ranges of ten Kappa values per column in Table 1. A two-sample independent t-test was conducted on two kinds of ranges derived from two distinct evaluation frameworks: whole-image analysis and contour-ring-based assessment. It revealed statistically significant inter-framework ranges (p = 0.01), rejecting the null hypothesis of distributional equivalence. This analytical approach demonstrated that the dilation-based methodology significantly enhances annotation consistency measurement precision, confirming its optimization capacity in image labeling quality assessment.

It is important to note that the differentiation of Kappa values calculated based on the contour ring confusion matrix can be adjusted by altering the dilation times of contours. Increasing the dilation times reduces differentiation. For tomato stem images, relatively more dilation times can be used, whereas for group-reared pig images, relatively less dilation times are recommended.

In biological images, different parts have varying pixel numbers, such as the abdomen and tail of pigs, or the roots of plants. It is true that the accuracy of the edge detection results will be affected by the different sizes and shapes of the foreground areas in images, especially the complexity of changes on contour edges. However, the difference on contour edges can be minimized through dilating operations to obtain contour rings. Two sets of images with subtle differences in labeled edges were selected to test the effect caused by the differences on contour edges. The contour rings produced from these two image sets are represented by Contour Ring I and Contour Ring II. The Kappa values based on the confusion matrix for these two kinds of contour rings were calculated, and the results are shown in Table 3.

It shows that differences in contour edges have an insignificant impact on Kappa value calculation after dilation operations. The reason is that pixels in contour rings were obtained through dilation operations of contour edge detection results, and Kappa values were calculated based on the total pixels in contour rings. As a result, the accuracy requirements for edge detection results are not excessively stringent.

3.4. Comparative Analysis of the Image Labeling Quality for Images with Different Labeling Difficulty

Upon comparing Kappa values of tomato stem images and group-reared pig images from Table 1, the following intervals are observed for Kappa values obtained from the whole image confusion matrix: K_A (tomato stem image) [0.817, 0.945], K_B (tomato stem image) [0.852, 0.983], K_AB (tomato stem image) [0.799, 0.945], K_A,B (tomato stem image) [0.652, 0.899], K_A (group-reared pig image) [0.986, 0.991], K_B (group-reared pig image) [0.991, 0.994], K_AB (group-reared pig image) [0.979, 0.992], and K_A,B (group-reared pig image) [0.967, 0.984].

For Kappa values obtained from the contour ring confusion matrix of tomato stem images, K_A is in the interval [0.642, 0.841], K_B is in the interval [0.639, 0.910], K_AB is in the interval [0.575, 0.865], and K_A,B is in the interval [0.372, 0.725]. For group-reared pig images, K_A is in the interval [0.698, 0.805], K_B is in the interval [0.808, 0.857], K_AB is in the interval [0.752, 0.819], and K_A,B is in the interval [0.560, 0.662].

It is worth noting that Kappa values of group-reared pig images are greater than those of tomato stem images even when the contours of group-reared pigs are dilated using a smaller convolution kernel size and more dilation times during contour ring setting up.

The labeling quality of tomato stem images is lower than that of group-reared pig images, which is supported by the fact that labeling group-reared pig images is easier. It further confirms the effectiveness of the quantitative analysis method proposed in this study for evaluating image labeling quality based on attribute agreement analysis. As an ancillary conclusion, Kappa values can also be used as a metric of image labeling difficulty.

The core innovation of this study lies in (1) establishing a comprehensive framework for assessing image labeling quality through three dimensions: internal, external, and overall labeling quality consistency, and (2) introducing a contour ring mechanism to effectively address the diminished discriminative power of evaluation metrics stemming from a foreground–background pixel imbalance. Within this methodological framework, we have systematically designed a measurement protocol incorporating confusion matrix computation and standardized testing procedures.

While specific metric selection does not constitute the primary focus of this research nor impact its fundamental conclusions, it should be noted that alternative metrics such as IoU (Intersection over Union) remain methodologically compatible. Notably, our Kappa coefficient implementation operates at the pixel-level annotation scale, with its methodological superiority residing in explicitly accounting for random chance factors through its probabilistic interpretation framework. To empirically validate metric interchangeability, we conducted comparative analyses using tomato stem imagery datasets. The observed Pearson correlation coefficient of 0.998 between Kappa and IoU values demonstrates near-perfect agreement. This statistical evidence confirms that both metrics can serve as interchangeable agreement measures within our evaluation system without compromising analytical validity.

It should be mentioned that this study primarily focused on the agreement analysis of image labeling results. Image labeling quality can be analyzed by examining the total fluctuation of labeling workers’ repeated labeling results over a period. On the contrary, it is suitable for studying the general temporal labeling behavior of labeling workers, analyzing the labeling habits of specific labeling workers, and identifying factors influencing labeling behaviors through dynamic analysis of image labeling variations over time. Automated quality control is also a promising direction. Integrating automated checks (e.g., real-time Kappa monitoring) could streamline quality assessment in practice. Such analysis can help identify specific areas for image labeling quality control and improvement. We will explore this in subsequent research, particularly for large-scale labeling workflows.

In practical applications, human image labeling is a typical game-theoretic problem. If the labeling is excessively precise, it will consume substantial time and human resources. However, minor labeling errors are acceptable as long as they do not affect the final recognition performance. In some cases, a moderate level of labeling error may even enhance model robustness, making it more adaptable in real-world applications. Therefore, when evaluating labeling quality, labeling consistency should not be the sole criterion. Instead, the evaluation should consider the practical application context and the model’s sensitivity to different labeling errors, thereby balancing labeling cost and labeling quality effectively. In other words, different image annotation quality, such as precise annotation, annotation with minor errors, and annotation with a moderate level of errors, can be identified by the method proposed in this study. Therefore, labelers in appropriate labeling quality levels can be selected according to the practical application context and the model’s sensitivity to different annotation errors.

This study mainly focuses on biological image instance segmentation. The method presented in this paper can also be used in the image labeling quality analysis of other types of image segmentation. We will incorporate other types of segmentation on biological images in further research.

4. Conclusions

Image labeling quality analysis is crucial to improve the quality of deep learning datasets. The attribute agreement analysis method provides a new solution for image labeling quality analysis.

The attribute agreement analysis method enables the quantitative evaluation of labeling variation and labeling bias. The analysis of labeling variation includes the internal, external, and overall image labeling quality of semantically segmented biological images. It allows for the differentiation of image labeling quality among different labeling results and labeling standards, and provides a quantitative assessment of the overall image labeling quality. Factors such as foreground singularity, occlusion, and stains in image significantly impact the labeling quality.

The use of contour rings enhances the differentiation of Kappa values for different image labeling qualities in semantically segmented biological images. This approach addresses the issue of a slow response caused by the imbalance of positive and negative samples during labeling. Setting the contour ring includes adjusting the convolution kernel size and the dilation times, which affects the stringency of the evaluation criteria for image labeling quality and the differentiation of Kappa values. Smaller convolution kernels and fewer dilations result in stricter evaluation criteria and greater differentiation of Kappa values.

The attribute agreement analysis method allows for quantitative analysis of labeling quality based on image labeling difficulty. Images with greater labeling difficulty, such as tomato stem images, yield lower Kappa values in labeling results, i.e., poorer image labeling quality. Images with lower labeling difficulty, such as group-reared pig images, yield higher Kappa values in labeling results, i.e., higher image labeling quality. As an ancillary conclusion, Kappa values can also be used as a metric of image labeling difficulty.

The proposed method can also be applied in labeling quality analysis of other types of images for semantic segmentation, and dynamic analysis of image labeling variations over time needs further research.

Author Contributions

Conceptualization, R.X. and X.Y.; methodology, R.X.; software, X.Y.; validation, R.X. and X.Y.; formal analysis, X.Y.; investigation, R.X., X.Y. and Y.Z.; resources, R.X., X.Y., Y.Z. and X.Z.; data curation, X.Y.; writing—original draft preparation, X.Y.; writing—review and editing, R.X. and X.Y.; visualization, R.X. and X.Y.; supervision, R.X.; project administration, R.X.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key R&D Program of Zhejiang Province, grant number 2022C02024.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

We appreciate Zhejiang Academy of Agricultural Sciences for providing a test site for group-reared pigs’ image acquisition.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Y.; Yu, X.; Yang, Y.; Zhang, X.; Zhang, Y.; Zhang, L.; Feng, R.; Xue, J. A multi-branched semantic segmentation network based on twisted information sharing pattern for medical images. Comput. Methods Programs Biomed. 2024, 243, 107914. [Google Scholar] [CrossRef]
Ulusoy, U.; Eren, O.; Demirhan, A. Development of an obstacle avoiding autonomous vehicle by using stereo depth estimation and artificial intelligence based semantic segmentation. Eng. Appl. Artif. Intell. 2023, 126, 106808. [Google Scholar] [CrossRef]
Wang, F.; Luo, X.; Wang, Q.; Li, L. Aerial-BiSeNet: A real-time semantic segmentation network for high resolution aerial imagery. Chin. J. Aeronaut. 2021, 34, 47–59. [Google Scholar] [CrossRef]
Puthumanaillam, G.; Verma, U. Texture based prototypical network for few-shot semantic segmentation of forest cover: Generalizing for different geographical regions. Neurocomputing 2023, 538, 126201. [Google Scholar] [CrossRef]
Luo, Z.; Yang, W.; Yuan, Y.; Gou, R.; Li, X. Semantic segmentation of agricultural images: A survey. Inf. Process. Agric. 2024, 11, 172–186. [Google Scholar] [CrossRef]
Liu, Y.; Wang, B.; Sheng, Q.; Li, J.; Zhao, H.; Wang, S.; Liu, X.; He, H. Dual-polarization SAR rice growth model: A modeling approach for monitoring plant height by combining crop growth patterns with spatiotemporal SAR data. Comput. Electron. Agric. 2023, 215, 108358. [Google Scholar] [CrossRef]
Polder, G.; Dieleman, J.A.; Hageraats, S.; Meinen, E. Imaging spectroscopy for monitoring the crop status of tomato plants. Comput. Electron. Agric. 2024, 216, 108504. [Google Scholar] [CrossRef]
Cao, Y.; Chen, L.; Yuan, Y.; Sun, G. Cucumber disease recognition with small samples using image-text-label-based multi-modal language model. Comput. Electron. Agric. 2023, 211, 107993. [Google Scholar] [CrossRef]
Talukder, M.S.H.; Bin Sulaiman, R.; Chowdhury, M.R.; Nipun, M.S.; Islam, T. PotatoPestNet: A CTInceptionV3-RS-based neural network for accurate identification of potato pests. Smart Agric. Technol. 2023, 5, 100297. [Google Scholar] [CrossRef]
Su, J.; Anderson, S.; Javed, M.; Khompatraporn, C.; Udomsakdigool, A.; Mihaylova, L. Plant leaf deep semantic segmentation and a novel benchmark dataset for morning glory plant harvesting. Neurocomputing 2023, 555, 126609. [Google Scholar] [CrossRef]
Ghimire, D.; Lee, K.; Kim, S. Loss-aware automatic selection of structured pruning criteria for deep neural network acceleration. Image Vis. Comput. 2023, 136, 104745. [Google Scholar] [CrossRef]
Dairath, M.H.; Akram, M.W.; Mehmood, M.A.; Sarwar, H.U.; Akram, M.Z.; Omar, M.M.; Faheem, M. Computer vision-based prototype robotic picking cum grading system for fruits. Smart Agric. Technol. 2023, 4, 100210. [Google Scholar] [CrossRef]
Hao, H.; Jincheng, Y.; Ling, Y.; Gengyuan, C.; Sumin, Z.; Huan, Z. An improved PointNet++ point cloud segmentation model applied to automatic measurement method of pig body size. Comput. Electron. Agric. 2023, 205, 107560. [Google Scholar] [CrossRef]
Han, J.; Siegford, J.; Colbry, D.; Lesiyon, R.; Bosgraaf, A.; Chen, C.; Norton, T.; Steibel, J.P. Evaluation of computer vision for detecting agonistic behavior of pigs in a single-space feeding stall through blocked cross-validation strategies. Comput. Electron. Agric. 2023, 204, 107520. [Google Scholar] [CrossRef]
Riaboff, L.; Shalloo, L.; Smeaton, A.F.; Couvreur, S.; Madouasse, A.; Keane, M.T. Predicting livestock behaviour using accelerometers: A systematic review of processing techniques for ruminant behaviour prediction from raw accelerometer data. Comput. Electron. Agric. 2022, 192, 106610. [Google Scholar] [CrossRef]
Wang, H.; Shen, W.; Zhang, Y.; Gao, M.; Zhang, Q.; Xiaohui, A.; Du, H.; Qiu, B. Diagnosis of dairy cow diseases by knowledge-driven deep learning based on the text reports of illness state. Comput. Electron. Agric. 2023, 205, 107564. [Google Scholar] [CrossRef]
Xu, B.; Wang, W.; Falzon, G.; Kwan, P.; Guo, L.; Chen, G.; Tait, A.; Schneider, D. Automated cattle counting using Mask R-CNN in quadcopter vision system. Comput. Electron. Agric. 2020, 171, 105300. [Google Scholar] [CrossRef]
Li, J.; Chen, D.; Qi, X.; Li, Z.; Huang, Y.; Morris, D.; Tan, X. Label-efficient learning in agriculture: A comprehensive review. Comput. Electron. Agric. 2023, 215, 108412. [Google Scholar] [CrossRef]
Xu, Z.; Lu, D.; Luo, J.; Wang, Y.; Yan, J.; Ma, K.; Zheng, Y.; Tong, R.K.Y. Anti-interference from noisy labels: Mean-teacher-assisted confident learning for medical image segmentation. IEEE Trans. Med. Imaging 2022, 41, 3062–3073. [Google Scholar] [CrossRef]
Zhang, J.; Sheng, V.S.; Li, T.; Wu, X. Improving crowdsourced label quality using noise correction. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 1675–1688. [Google Scholar] [CrossRef]
Tao, F.; Jiang, L.; Li, C. Label similarity-based weighted soft majority voting and pairing for crowdsourcing. Knowl. Inf. Syst. 2020, 62, 2521–2538. [Google Scholar] [CrossRef]
Pičuljan, N.; Car, Ž. Machine learning-based label quality assurance for object detection projects in requirements engineering. Appl. Sci. 2023, 13, 6234. [Google Scholar] [CrossRef]
Wang, W.; Zhou, Z.H. Crowdsourcing label quality: A theoretical analysis. Sci. China Inf. Sci. 2015, 58, 1–12. [Google Scholar] [CrossRef]
Wang, W.; Guo, X.Y.; Li, S.Y.; Jiang, Y.; Zhou, Z.H. Obtaining high-quality label by distinguishing between easy and hard items in crowdsourcing. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017. [Google Scholar] [CrossRef]
Al-Refaie, A.; Bata, N. Evaluating measurement and process capabilities by GR&R with four quality measures. Measurement 2010, 43, 842–851. [Google Scholar] [CrossRef]
Gao, W. Measurement System Analysis, 1st ed.; China Standards Press: Beijing, China, 2004. [Google Scholar]
Montgomery, D.C. Introduction to Statistical Quality Control, 4th ed.; Wiley: Hoboken, NJ, USA, 2000. [Google Scholar]
Mansilla, L.A.C.; Miranda, P.A.V.; Cappabianco, F.A.M. Oriented image foresting transform segmentation with connectivity constraints. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016. [Google Scholar] [CrossRef]
Russell, B.C.; Torralba, A.; Murphy, K.P.; Freeman, W.T. LabelMe: A Database and Web-Based Tool for Image Annotation. Int. J. Comput. Vis. 2008, 77, 157–173. [Google Scholar] [CrossRef]
Math, R.M.; Dharwadkar, N.V. Deep learning and computer vision for leaf miner infestation severity detection on muskmelon (Cucumis melo) leaves. Comput. Electr. Eng. 2023, 110, 108843. [Google Scholar] [CrossRef]
Ferreira, I.S.B.; Peruchi, R.S.; Fernandes, N.J.; Rotella Junior, P. Measurement system analysis in angle of repose of fertilizers with distinct granulometries. Measurement 2021, 170, 108681. [Google Scholar] [CrossRef]
Araújo, L.M.M.; Paiva, R.G.N.; Peruchi, R.S.; Junior, P.R.; Gomes, J.H.D.F. New indicators for measurement error detection in GR&R studies. Measurement 2019, 140, 557–564. [Google Scholar] [CrossRef]
Xu, J.; Zhang, Y.; Miao, D. Three-way confusion matrix for classification: A measure driven view. Inf. Sci. 2020, 507, 772–794. [Google Scholar] [CrossRef]
Wang, Y.; Jia, Y.; Tian, Y.; Xiao, J. Deep reinforcement learning with the confusion-matrix-based dynamic reward function for customer credit scoring. Expert Syst. Appl. 2022, 200, 117013. [Google Scholar] [CrossRef]
Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Valero-Carreras, D.; Alcaraz, J.; Landete, M. Comparing two SVM models through different metrics based on the confusion matrix. Comput. Oper. Res. 2023, 152, 106131. [Google Scholar] [CrossRef]
Xiang, R.; Chen, Y.; Shen, J.W.; Hu, S. Method for assessing the quality of data used in evaluating the performance of recognition algorithms for fruits and vegetables. Biosyst. Eng. 2017, 156, 27–37. [Google Scholar] [CrossRef]

Figure 1. Image samples. Note: non-English term in group-reared pig images are Chinese characters indicating the image capture dates.

Figure 2. Composition of the attribute measurement system. Note: non-English term in group-reared pig images are Chinese characters indicating the image capture dates.

Figure 3. Flowchart of quantitative analysis method for image labeling variation. Note: non-English term in group-reared pig images are Chinese characters indicating the image capture dates.

Figure 4. Flowchart of confusion matrix computation method: (a–d) are FP, TP, FN, and TN in whole image, respectively, and (e–h) are FP, TP, FN, and TN in contour ring.

Figure 5. Contour ring extraction process. (a) labeled image, (b) extended image, (c) stem edge image, (d) partial enlargement of stem edge, (e) edge dilation result, (f) contour ring.

Figure 6. Flowchart of internal agreement analysis (taking tomato stem image as an example).

Figure 7. Flowchart of external agreement analysis (taking tomato stem image as an example).

Figure 8. Flowchart of overall agreement analysis (taking tomato stem image as an example).

Figure 9. Examples of image labeling results: from left to right are the first and second labeling results of worker A and B, respectively. Different colors and shapes represent different regions of the image. Notes: in the blue dotted rectangle box of No.15, the bottom subfigures are the partial enlarged figures in the white rectangular boxes in the top subfigures.

Table 1. Kappa values of image labeling results.

Image No.	Kappa Values Based on Whole Image Confusion Matrix				Kappa Values Based on Contour Ring Confusion Matrix
Image No.	K_A	K_B	K_AB	K_A,B	K_A	K_B	K_AB	K_A,B
Tomato stem images
No. 1	0.945	0.872	0.925	0.827	0.841	0.639	0.795	0.529
No. 2	0.922	0.983	0.936	0.888	0.741	0.941	0.790	0.635
No. 3	0.817	0.853	0.804	0.655	0.642	0.672	0.655	0.372
No. 4	0.913	0.852	0.888	0.772	0.795	0.667	0.755	0.507
No. 5	0.920	0.890	0.823	0.737	0.797	0.716	0.592	0.413
No. 6	0.890	0.917	0.907	0.794	0.717	0.775	0.749	0.490
No. 7	0.886	0.921	0.859	0.767	0.694	0.786	0.627	0.428
No. 8	0.881	0.864	0.799	0.652	0.725	0.682	0.575	0.320
No. 9	0.884	0.896	0.849	0.731	0.745	0.764	0.679	0.451
No. 10	0.928	0.970	0.945	0.899	0.809	0.910	0.865	0.725
Range	0.128	0.131	0.146	0.247	0.199	0.302	0.290	0.405
Group-reared pig images
No. 11	0.987	0.992	0.987	0.980	0.805	0.834	0.805	0.662
No. 12	0.991	0.993	0.992	0.984	0.776	0.840	0.819	0.631
No. 13	0.990	0.991	0.986	0.976	0.801	0.812	0.772	0.598
No. 14	0.986	0.993	0.989	0.979	0.749	0.853	0.805	0.614
No. 15	0.988	0.994	0.991	0.983	0.698	0.852	0.752	0.572
No. 16	0.989	0.993	0.979	0.967	0.772	0.857	0.752	0.560
No. 17	0.991	0.993	0.991	0.983	0.764	0.826	0.786	0.606
No. 18	0.990	0.993	0.990	0.982	0.777	0.840	0.785	0.614
No. 19	0.990	0.993	0.992	0.983	0.764	0.841	0.808	0.618
No. 20	0.989	0.991	0.991	0.980	0.749	0.808	0.803	0.579
Range	0.005	0.003	0.013	0.017	0.107	0.049	0.067	0.102

Table 2. Kappa values between labeling results and labeling standards of bird images.

Image No.	Kappa Values Based on Whole Image Confusion Matrix						Kappa Values Based on Contour Ring Confusion Matrix
Image No.	K_A	K_B	K_AB	K_A,B	K_AC	K_BC	K_A	K_B	K_AB	K_A,B	K_AC	K_BC
No. 21	0.984	0.984	0.987	0.970	0.985	0.985	0.867	0.869	0.888	0.675	0.875	0.874
No. 22	0.975	0.971	0.975	0.947	0.968	0.970	0.826	0.802	0.826	0.650	0.791	0.804
No. 23	0.975	0.975	0.975	0.951	0.979	0.979	0.787	0.794	0.792	0.605	0.823	0.832
No. 24	0.982	0.976	0.981	0.961	0.977	0.979	0.863	0.820	0.853	0.709	0.820	0.843
No. 25	0.970	0.979	0.977	0.953	0.976	0.981	0.786	0.851	0.835	0.675	0.836	0.872
No. 26	0.983	0.980	0.983	0.964	0.982	0.984	0.857	0.848	0.872	0.726	0.861	0.875
No. 27	0.949	0.948	0.948	0.898	0.951	0.955	0.842	0.841	0.843	0.691	0.858	0.868
No. 28	0.979	0.977	0.979	0.957	0.978	0.980	0.848	0.827	0.846	0.693	0.850	0.857
No. 29	0.947	0.966	0.962	0.917	0.952	0.949	0.753	0.842	0.819	0.626	0.789	0.775
No. 30	0.980	0.979	0.980	0.961	0.980	0.981	0.842	0.829	0.838	0.692	0.846	0.851
20–30	0.972	0.974	0.975	0.948	0.973	0.974	0.828	0.832	0.841	0.674	0.835	0.845
No. 31	0.973	0.975	0.970	0.946	0.969	0.971	0.883	0.893	0.864	0.762	0.857	0.874
No. 32	0.979	0.800	0.978	0.958	0.984	0.981	0.809	0.807	0.818	0.639	0.855	0.828
No. 33	0.971	0.975	0.976	0.950	0.970	0.973	0.808	0.831	0.839	0.668	0.814	0.829
No. 34	0.974	0.978	0.978	0.953	0.977	0.981	0.814	0.839	0.838	0.667	0.838	0.865
No. 35	0.987	0.987	0.990	0.976	0.989	0.991	0.773	0.750	0.799	0.587	0.815	0.847
No. 36	0.971	0.971	0.975	0.948	0.974	0.976	0.766	0.770	0.807	0.599	0.805	0.823
No. 37	0.980	0.977	0.979	0.959	0.983	0.980	0.815	0.786	0.799	0.626	0.846	0.833
No. 38	0.983	0.984	0.973	0.958	0.983	0.974	0.824	0.810	0.807	0.648	0.826	0.787
No. 39	0.969	0.972	0.972	0.946	0.974	0.970	0.824	0.836	0.839	0.694	0.855	0.832
No. 40	0.975	0.977	0.973	0.949	0.974	0.977	0.815	0.820	0.805	0.634	0.822	0.844
30–40	0.976	0.960	0.976	0.954	0.978	0.977	0.813	0.814	0.822	0.652	0.833	0.836
1–40	0.974	0.967	0.976	0.951	0.975	0.976	0.820	0.823	0.831	0.663	0.834	0.841
p-values	0.702	0.473	0.840	0.701	0.586	0.736	0.636	0.390	0.352	0.488	0.941	0.722

Table 3. Kappa values for different contour rings produced from different edge detection results.

Image No.	Kappa Value for Confusion Matrix of Contour Ring I				Kappa Value for Confusion Matrix of Contour Ring II
Image No.	K_A	K_B	K_AB	K_A,B	K_A	K_B	K_AB	K_A,B
No. 41	0.912	0.941	0.914	0.859	0.912	0.942	0.914	0.859
No. 42	0.922	0.942	0.934	0.863	0.922	0.942	0.934	0.863
No. 43	0.921	0.927	0.894	0.823	0.925	0.932	0.898	0.828
No. 44	0.894	0.942	0.918	0.838	0.894	0.945	0.920	0.839
No. 45	0.894	0.948	0.915	0.844	0.894	0.948	0.915	0.844
No. 46	0.923	0.950	0.866	0.791	0.922	0.950	0.866	0.790
No. 47	0.920	0.933	0.917	0.850	0.920	0.933	0.917	0.850
No. 48	0.918	0.944	0.921	0.855	0.918	0.944	0.921	0.855
No. 49	0.912	0.942	0.930	0.855	0.912	0.942	0.929	0.855
No. 50	0.905	0.921	0.917	0.821	0.905	0.921	0.917	0.821

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiang, R.; Yuan, X.; Zhang, Y.; Zhang, X. Quantitative Analysis of the Labeling Quality of Biological Images for Semantic Segmentation Based on Attribute Agreement Analysis. Agriculture 2025, 15, 680. https://doi.org/10.3390/agriculture15070680

AMA Style

Xiang R, Yuan X, Zhang Y, Zhang X. Quantitative Analysis of the Labeling Quality of Biological Images for Semantic Segmentation Based on Attribute Agreement Analysis. Agriculture. 2025; 15(7):680. https://doi.org/10.3390/agriculture15070680

Chicago/Turabian Style

Xiang, Rong, Xinyu Yuan, Yi Zhang, and Xiaomin Zhang. 2025. "Quantitative Analysis of the Labeling Quality of Biological Images for Semantic Segmentation Based on Attribute Agreement Analysis" Agriculture 15, no. 7: 680. https://doi.org/10.3390/agriculture15070680

APA Style

Xiang, R., Yuan, X., Zhang, Y., & Zhang, X. (2025). Quantitative Analysis of the Labeling Quality of Biological Images for Semantic Segmentation Based on Attribute Agreement Analysis. Agriculture, 15(7), 680. https://doi.org/10.3390/agriculture15070680

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Quantitative Analysis of the Labeling Quality of Biological Images for Semantic Segmentation Based on Attribute Agreement Analysis

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Dataset

2.1.1. Biological Images for Semantic Segmentation

2.1.2. Image Labeling

2.2. Programming Tools and Environment

2.3. Measurement System Analysis

2.4. Quantitative Analysis Method for Labeling Variation

2.4.1. Attribute Agreement Analysis Method Based on Kappa Values

2.4.2. Internal Image Labeling Analysis

2.4.3. External Image Labeling Analysis

2.4.4. Overall Image Labeling Quality Analysis

2.4.5. Quantitative Analysis Method for Labeling Bias

3. Results and Discussion

3.1. Image Labeling Quality Analysis of Labeling Variation

3.2. Image Labeling Quality Analysis of Labeling Bias

3.3. Comparative Analysis Between Whole Image Confusion Matrix and Contour Ring Confusion Matrix

3.4. Comparative Analysis of the Image Labeling Quality for Images with Different Labeling Difficulty

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI