Multimodal Feature-Driven Deep Learning for the Prediction of Duck Body Dimensions and Weight

Xiao, Wenbo; Han, Qiannan; Shu, Gang; Liang, Guiping; Zhang, Hongyan; Wang, Song; Xu, Zhihao; Wan, Weican; Li, Chuang; Jiang, Guitao; Xiao, Yi

doi:10.3390/agriculture15101021

Open AccessArticle

Multimodal Feature-Driven Deep Learning for the Prediction of Duck Body Dimensions and Weight

by

Wenbo Xiao

^1,2,3,†

,

Qiannan Han

^1,†,

Gang Shu

⁴

,

Guiping Liang

¹,

Hongyan Zhang

¹,

Song Wang

¹,

Zhihao Xu

¹,

Weican Wan

²

,

Chuang Li

²,

Guitao Jiang

^2,* and

Yi Xiao

^1,*

¹

College of Information and Technology, Hunan Agricultural University, Changsha 410128, China

²

Institute of Animal Sciences and Veterinary Medicine, Hunan Academy of Agricultural Sciences, Changsha 410131, China

³

School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia

⁴

College of Animal Science and Technology, Sichuan Agricultural University, Chengdu 611130, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agriculture 2025, 15(10), 1021; https://doi.org/10.3390/agriculture15101021

Submission received: 29 March 2025 / Revised: 26 April 2025 / Accepted: 1 May 2025 / Published: 8 May 2025

(This article belongs to the Special Issue Precision Livestock Farming and Artificial Intelligence for Sustainable Livestock Systems)

Download

Browse Figures

Versions Notes

Abstract

Accurate body dimension and weight measurements are critical for optimizing poultry management, health assessment, and economic efficiency. This study introduces an innovative deep learning-based model leveraging multimodal data—2D RGB images from different views, depth images, and 3D point clouds—for the non-invasive estimation of duck body dimensions and weight. A dataset of 1023 Linwu ducks, comprising over 5000 samples with diverse postures and conditions, was collected to support model training. The proposed method innovatively employs PointNet++ to extract key feature points from point clouds, extracts and computes corresponding 3D geometric features, and fuses them with multi-view convolutional 2D features. A Transformer encoder is then utilized to capture long-range dependencies and refine feature interactions, thereby enhancing prediction robustness. The model achieved a mean absolute percentage error (MAPE) of 5.73% and an R² of 0.953 across seven morphometric parameters describing body dimensions, and an MAPE of 10.49% with an R² of 0.952 for body weight, indicating robust and consistent predictive performance across both structural and mass-related phenotypes. Unlike conventional manual measurements, the proposed model enables high-precision estimation while eliminating the necessity for physical handling, thereby reducing animal stress and broadening its application scope. This study marks the first application of deep learning techniques to poultry body dimension and weight estimation, providing a valuable reference for the intelligent and precise management of the livestock industry with far-reaching practical significance.

Keywords:

poultry; weight prediction; body dimension prediction; multimodal fusion; deep learning; point cloud

1. Introduction

Duck body measurements, including indices such as body diagonal length, keel length, chest depth, and chest width, along with weight, play a vital role in real-time monitoring of growth and development [1]. These metrics are essential for understanding individual differences, identifying outliers (e.g., excessively large or small ducks), and diagnosing potential nutritional issues. By facilitating timely human intervention in feeding strategies, such measurements contribute to improving breeding efficiency and provide data-driven support for intelligent and precision farming. In addition to these general applications, studies have shown that body measurements are directly related to growth performance, carcass quality, and selection criteria in duck breeding. For instance, the Pekin duck’s body measurements, including keel length and body diagonal, are correlated with its growth rate and carcass composition, suggesting that larger body dimensions are associated with higher meat yields and better carcass quality. This connection further emphasizes the importance of accurate body measurements in evaluating the potential of duck breeds for meat production and optimizing breeding strategies [2].

Traditionally, duck body measurements rely on manual methods involving tape measures and calipers, which require additional human assistance to hold ducks stationary. This process is not only labor-intensive and time-consuming but also induces stress in ducks, potentially affecting their subsequent growth and development. As computer technology advances and artificial intelligence matures, contactless measurement methods leveraging computer vision and AI are increasingly expected to replace traditional techniques. These innovations offer promising opportunities for reducing labor demands, minimizing measurement errors, and supporting stress-free data collection, paving the way for precision farming applications.

Researchers around the world have extensively studied the application of computer technology in livestock farming, particularly in poultry farming [3]. With the development of deep learning, Convolutional Neural Networks (CNNs) and Transformer models have become prominent tools for poultry body dimension and weight estimation. CNNs have shown significant advantages in extracting image features. For example, Zhuang and Zhang (2019) improved the SSD model for real-time health monitoring of broilers, achieving a 99.7% mean average precision [4]. In 2023, Duan et al. employed a multi-object tracking model for monitoring group-raised duck activeness [5], demonstrating the utility of CNNs in poultry behavior analysis. Despite their success, CNNs have limitations, particularly in situations where complex spatial relationships across different features need to be captured, making it difficult to handle diverse or distant visual information [6].

The Transformer model [7], introduced in 2017, addresses some of these limitations by using attention mechanisms to capture long-range dependencies across input sequences. Vision Transformer (ViT) [8] has demonstrated strong global perception capabilities, making it ideal for tasks that require understanding the relationships between distant features. However, Transformer-based models tend to underperform, particularly when the dataset is small, as they are less efficient during training and their performance decreases in such scenarios [9]. To overcome this, combining CNNs with Transformers has proven beneficial for capturing both local features and global dependencies. For instance, He et al. (2023) combined CNN and Transformer models to propose the Residual Transformer Fine-Grained (ResTFG) for fine-grained classification of chicken Eimeria, achieving superior precision and inference speed compared to other models [10].

In recent years, significant progress has been made in the prediction of body dimensions for large livestock using computer vision and sensor technologies [11]. For example, Zhang et al. (2018) applied image processing techniques combined with SLIC superpixels and fuzzy clustering algorithms of C-means to perform foreground segmentation, centerline extraction, and automatic measurement point identification, enabling the estimation of body dimensions for sheep [12]. Similarly, Du et al. (2022) employed a 2D-3D fusion approach, utilizing deep learning models to detect key points in RGB images, which were then projected onto the surface of point clouds [13]. By integrating interpolation and pose normalization techniques, their method achieved automatic measurement of multiple body dimension parameters for cattle and pigs, with MAPEs reduced to below 10%. Moreover, in 2023, Hao et al. improved the PointNet++ point cloud segmentation model by subdividing point clouds into local regions such as the head, ears, and torso [13]. This refinement enhanced the accuracy of key point localization for measurements, and with additional geometric processing algorithms, the relative errors in multiple body dimension parameters for pigs were significantly reduced.

Despite these advancements, research on body dimension prediction remains confined to large livestock. Due to the greater cost sensitivity in poultry farming, as well as the larger range of motion in their body shape, it is more challenging to estimate body dimensions using shape and geometric analysis. Additionally, for waterfowl such as ducks and geese, their regular need to swim makes it difficult to collect visual data in a fixed, controlled environment as is accomplished with confined livestock. As a result, studies addressing the prediction of weight and body dimensions for poultry are virtually non-existent. To bridge this gap, this study proposes a method for visual data acquisition of ducks and a neural network model based entirely on deep learning techniques, utilizing the extraction and fusion of 3D spatial features and 2D image features to predict the weight and body dimensions of ducks. This is the first computer vision-based model aimed at weight and body dimension prediction for poultry. The primary objectives of this study are as follows:

To propose a comprehensive hardware–software integrated multidimensional and multi-view visual data acquisition scheme for ducks, which is utilized to collect a dataset of duck visual data along with their corresponding body dimensions and weight.
To propose a method combining PointNet++ to identify key points in the point cloud and compute the 3D geometric features of the duck.
To propose a deep learning model combining 2D convolutional features and 3D geometric features to predict the body dimensions and weight of the duck.
To evaluate the performance and effectiveness of the model and discuss potential avenues for future improvements.

2. Data Processing

2.1. Dataset Description

A suitable dataset is crucial for deep learning applications. For the prediction of duck body dimensions, it is essential to have visual information of the ducks, along with corresponding ground truth measurements as annotations. However, to date, there is no publicly available dataset in this field that meets these requirements. Therefore, we collected the dataset by visiting Hunan Linwu Shunhua Duck Industrial Development Co., Ltd., located in Linwu County, Chenzhou, Hunan Province, China. We obtained side-view depth images, RGB images, and top-view RGB images of 1023 ducks, and measured their corresponding morphometric parameters as shown in Table 1. To enhance the generalization capability of the model, multiple sets of visual information were typically collected for each duck in different postures and states. In our research, 5238 sets of visual information of 1023 Linwu ducks were collected.

2.2. Collection Method

During the visual data acquisition process, we utilized the Intel RealSense D415 depth camera, which is capable of simultaneously capturing RGB and depth images beyond a distance of 0.4 m. Additionally, we employed the Logitech C270 camera by Logitech as an RGB camera to capture images from another angle. Several high-powered supplementary lights were used to enhance brightness, particularly in poorly lit environments.

In the actual measurement of the duck’s morphometric parameters, the weight was measured using an electronic weighing scale, with the measurement influenced by the duck’s feeding and hydration state, typically resulting in an error of ±75 g. The body diagonal length, neck length, semi-diving length, and keel length were measured using a measuring tape, with values rounded to the nearest 0.5 cm, leading to a measurement error of ±0.25 cm. The chest width, chest depth, and tibia length were measured using an electronic caliper, with the measurement error within ±0.15 cm. These measurements were manually recorded as annotations.

Ducks, particularly the older Linwu ducks, are inherently lively and exhibit strong reactions to human contact, making it extremely challenging to capture visual information. Therefore, during photography, a relatively enclosed space is required to keep the ducks calm. We designed a relatively enclosed shooting box, with dimensions of 0.6 m in length and width, and a height of 0.8 m, as shown in Figure 1, and arranged the imaging equipment on the top and sides of the box. Given that the depth camera has specific distance requirements for capturing objects, and that too close a distance prevents the camera from capturing the full image of the duck, it is necessary to maintain a certain distance between the camera and the subject. However, an excessive shooting distance would create additional space, giving the duck more room to move, and making it easier for the duck to escape if the side with the camera is left open. To address this, we designed the shooting side as a 45-degree inclined plane and positioned the RealSense depth camera above the slope. The inclined plane measures 0.6 m in both length and width, and the RGB camera is positioned 1 m above the ground. This design not only creates an enclosed environment for the duck but also provides adequate space for capturing images through the extended distance created by the inclined plane.

The entire data collection process adhered to ethical standards, ensuring minimal stress and discomfort for the ducks. All procedures involving animal handling were carried out in compliance with institutional and national guidelines for the ethical use of animals in research. The ducks were treated with care and respect throughout the measurement and imaging sessions, ensuring their welfare was prioritized.

As illustrated in Figure 2, we developed a software system for capturing visual information. This software is capable of simultaneously capturing data from multiple camera angles and can automatically align the RealSense depth images with the RGB images during shooting. Additionally, it generates and saves point cloud data. The software also features automatic numbering, export, and rollback functions for operational convenience.

2.3. Features Extraction

Since the body dimensions of ducks are precise numerical values, the effectiveness of allowing neural networks to directly extract features from images may be suboptimal, particularly when the dataset is small. The specific differences will be compared and discussed in detail in Section 4. Therefore, it is necessary to incorporate additional feature extraction steps to obtain more direct information.

A point cloud is a data structure that represents the shape and spatial information of an object by capturing thousands or millions of three-dimensional coordinates of points on the object’s surface. It is widely used in 3D modeling, computer vision, and environmental perception [14]. The point cloud images collected during our data acquisition process can provide more intuitive characteristic information of the ducks. In [15], the researchers developed the D Point Cloud model to directly predict key points associated with cattle body dimensions, which were subsequently used to calculate the body dimension data of cattle. However, for poultry, the presence of thick feathers and greater morphological flexibility make it nearly impossible to directly calculate body dimensions from the key points of ducks. We aim to identify key points in the point cloud that are related to the body dimensions of ducks and use the geometric relationships between these points as features to provide additional information for subsequent body dimension predictions.

First, we applied a series of steps to denoise and filter the point cloud data. Initially, we used a statistical filtering method to identify and remove noise points by calculating the average distance of each point to its 20 nearest neighbors and comparing it to the global average distance. We set a standard deviation multiplier threshold of 2.0, meaning that any point whose average distance exceeds twice the global average distance is considered noise and subsequently removed. Following this denoising process, we performed a clustering analysis on the point cloud data and further filtered the clusters by retaining only those with a point count of 9000 or more. This approach not only effectively removed isolated and anomalous points but also ensured that the remaining point cloud clusters were sufficiently dense and representative, thereby enhancing the accuracy and reliability of subsequent processing steps.

In the denoising process, the average distance

{\bar{d}}_{i}

between each point

p_{i}

and its k nearest neighbors is calculated:

{\bar{d}}_{i} = \frac{1}{k} \sum_{j = 1}^{k} ∥ p_{i} - p_{i j} ∥

(1)

where

p_{i j}

denotes the j-th nearest neighbor of point

p_{i}

and

k = 20

is the number of nearest neighbors.

This average distance is then compared with the global average distance

{\bar{d}}_{global}

. A standard deviation multiplier threshold

σ_{threshold} = 2.0

is set. If

{\bar{d}}_{i}

satisfies the following condition, the point is considered noise and removed:

{\bar{d}}_{i} > {\bar{d}}_{global} + 2.0 \times σ_{global}

(2)

where

σ_{global}

represents the standard deviation of the global distances.

Subsequently, as shown in Figure 3, we selected seven relevant points as feature points.

Point A: Located at the foremost tip of the duck’s beak.
Point B: At the highest point of the duck’s head.
Point C: At the most prominent point where the duck’s neck curves towards the tail.
Point D: At the junction between the duck’s neck and chest.
Point E: Located at the very end of the duck’s tail.
Point F: At the top of the duck’s foot.
Point G: At the bottom of the duck’s foot.

Although not all of these points are necessarily meaningful from the perspective of avian science, they are distinctive and easily learnable features from a computational viewpoint. We developed a simple software tool for annotating feature points in point clouds and manually annotated the above key points in 150 point cloud images.

This translation is crafted to ensure clarity and precision, adhering to academic standards while effectively communicating the process and rationale behind the selection and annotation of the feature points.

Given that each processed point cloud image still contains over 200,000 points, it is essential to reduce the complexity to a manageable scale for deep learning models. We employed Farthest Point Sampling (FPS) [16] to uniformly downsample all point clouds to a consistent size of 8192 points per sample. FPS ensures spatial uniformity by iteratively selecting points that maximize the minimum distance to previously selected points, thus preserving critical geometric details. Formally, the FPS algorithm selects a subset of points S from the original set of points P using the following iterative criterion:

p_{n e w} = arg max_{p \in P ∖ S} (min_{q \in S} ∥ p - q ∥)

(3)

This method maintains an optimal spatial distribution, critical for capturing anatomical features accurately.

As shown in Figure 4b, the processed 8192-point clouds were then fed into a modified PointNet++ model [16], a hierarchical neural network recognized for effectively capturing local-to-global point cloud features through Set Abstraction (SA) layers. The original PointNet++ architecture, intended for classification and segmentation tasks, was adjusted to perform precise keypoint regression by altering the final fully connected layers.

Specifically, we retained the hierarchical feature extraction approach of PointNet++, which consists of sequential SA layers that progressively aggregate local and global spatial information. To perform regression tasks, the output classification or segmentation heads of PointNet++ were replaced with fully connected regression layers. Mathematically, the modified fully connected layers transform the global feature vector

F_{global}

(extracted from the last SA layer) is passed through two sequential non-linear mappings followed by a linear mapping, which directly predicts the three-dimensional coordinates of the seven keypoints. Formally, this transformation is defined as follows:

Y = W_{3} (ReLU (W_{2} (ReLU (W_{1} F_{global} + b_{1})) + b_{2})) + b_{3}

(4)

where

W_{i}

and

b_{i}

are weight matrices and bias vectors of the fully connected layers. The resulting output tensor dimension is

7 \times 3 = 21

, corresponding exactly to the coordinates of the seven predefined feature points. After training for 40 epochs, our modified PointNet++ achieved robust performance, evidenced by a mean squared error (MSE) of only 0.0009 on the test dataset.

After identifying the seven feature points for each point cloud sample, we extracted ten key geometric features from the seven points to describe their spatial relationships. Specifically, we calculated the following:

Distances between points:
- Distance between points A and B.
- Distance between points B and C.
- Distance between points C and D.
- Distance between points D and E.
- Distance between points E and F.
- Distance between points F and G.
Angles formed by points:
- Angle between points A, B, and C.
- Angle between points B, C, and D.
- Angle between points C, D, and E.
- Angle between points D, E, and F.

These feature values comprehensively reflect the relative positions and arrangements of the points, providing critical geometric information for our analysis.

2.4. Data Prepossessing

In the dataset, many visual data were of low quality due to factors such as lighting conditions, accidental occlusions, or duck movements during capture, rendering them unsuitable for training. Consequently, we filtered out a portion of the images that did not meet the required standards. Ultimately, a total of 4822 sets of image data were used for training.

In our study, during the preprocessing of depth data, we retained only the depth information within the range of 40 cm to 1200 cm, as the subjects of our images—ducks—typically remained within this distance. Depth values outside this range were considered noise and thus filtered out. Subsequently, the retained depth data were normalized to a standard grayscale range suitable for image representation. This preprocessing step effectively transformed the multidimensional depth data into a visual format that is easier to analyze and interpret.

The background of the images contained unnecessary content and thus needed to be removed. We employed U-Net [17], a CNN architecture designed primarily for biomedical image segmentation, characterized by its U-shaped structure that enables precise localization and classification by combining high-resolution features from the contracting path with upsampled outputs from the expanding path, to segment the background in images captured from side and top views. In our experiments, excellent segmentation results were achieved by annotating only 50 sets of images, as shown in Figure 5. The segmentation performance was evaluated with a Dice coefficient of 0.92 and an average surface distance (ASD) of 0.75 mm, demonstrating the model’s high accuracy in background removal. Since the depth images were already aligned with the side-view RGB images during acquisition, the depth data can be directly processed for background removal using the corresponding segmentation of the RGB images.

For the duck’s body weight and body dimension data, we applied Min–Max normalization to each column, ensuring more stable and reliable training in the subsequent stages.

3. Method

As shown in Figure 4, our proposed model consists of three main components: a Triple ResNet50 embedding module for feature extraction from images, a Transformer encoder module for capturing complex relationships among features, and a regression layer for predicting the duck body dimensions and weight.

To extract rich visual features from the images, we utilize three independent ResNet50 networks [18], each pre-trained on ImageNet. Each ResNet50 network processes one of the three input images: the top-view RGB image

I_{top}

, the side-view RGB image

I_{side}

, and the side-view depth grayscale image

I_{depth}

. The final classification layers of ResNet50 are removed, retaining the convolutional layers up to the penultimate layer to serve as feature extractors.

For each image

I_{i}

, the corresponding ResNet50 network outputs a feature map

F_{i} \in R^{C \times H \times W}

, where C is the number of channels, and H and W are the spatial dimensions. These feature maps are then reshaped into sequences of feature vectors suitable for the Transformer encoder. Specifically, each feature map

F_{i}

is flattened along the spatial dimensions and transposed to form a sequence

X_{i} \in R^{N \times D}

, where

N = H \times W

is the sequence length and

D = C

is the feature dimension:

X_{i} = reshape (F_{i}) \in R^{N \times D}

(5)

After obtaining the feature sequences

X_{i}

from the three ResNet50 models, we concatenate them along the sequence dimension to form a combined visual feature sequence:

X_{img} = [X_{top}, X_{side}, X_{depth}] \in R^{3 N \times D}

(6)

In addition to the image-based features, we incorporate ten geometric features

G \in R^{10}

extracted from the point cloud data. These features capture specific spatial relationships between key points on the duck’s body that are not directly represented in the image data. To integrate these geometric features with the visual features, we expand G along the sequence dimension to match the length of

X_{img}

:

G^{'} = repeat (G, 3 N) \in R^{3 N \times 10}

(7)

We then concatenate

G^{'}

with

X_{img}

along the feature dimension to form the final input sequence X:

X = [X_{img}, G^{'}] \in R^{3 N \times (D + 10)}

(8)

This combined feature sequence X contains both visual and geometric information, enabling the model to capture comprehensive representations of the duck’s body from different perspectives.

The integrated feature sequence X is input into a Transformer encoder module to capture complex spatial relationships and dependencies among the features. The Transformer encoder consists of multiple layers, each comprising a multi-head self-attention mechanism and a position-wise feed-forward network, as described in [7]. The self-attention mechanism allows the model to focus on different parts of the sequence to effectively capture global dependencies. Residual connections and layer normalization are applied around the attention and feed-forward sublayers to facilitate training stability.

The Transformer encoder outputs a refined sequence of feature representations

Z \in R^{3 N \times (D + 10)}

that capture the complex interactions among the visual and geometric features.

To aggregate the sequence of feature representations into a fixed-length vector suitable for regression, we apply average pooling over the sequence dimension:

z = \frac{1}{3 N} \sum_{i = 1}^{3 N} Z_{i} \in R^{D + 10}

(9)

Finally, a fully connected layer maps the pooled feature vector to the target outputs, which are the eight body dimension measurements:

\hat{y} = z W + b \in R^{8}

(10)

where

W \in R^{(D + 10) \times 8}

and

b \in R^{8}

are the weights and biases of the regression layer.

By integrating CNN and Transformer architectures, our model effectively captures both local visual features and global geometric relationships, enabling accurate estimation of duck body dimensions from multi-modal data sources.

4. Results and Discussions

4.1. Results

We utilized PyTorch 2.1.0 with CUDA 12.1 in an environment equipped with two Nvidia Tesla T4 GPUs for model training. The dataset was split into training, validation, and testing sets in a ratio of 8:1:1. After approximately 20 epochs of training, the model gradually converged.

The training process was carried out with the following hyperparameter settings in Table 2:

These hyperparameters were selected through a combination of grid search and manual fine-tuning, with the goal of optimizing both model convergence speed and predictive performance. Notably, the StepLR scheduler was employed to reduce the learning rate by a factor of 0.1 every 10 epochs, helping the model to converge more effectively in later stages of training.

The best-performing model was selected for weight and body size prediction, achieving the metrics results presented in Table 3, with an overall MAPE of 5.73% and R² of 0.953 in body dimensions prediction and R² of 0.952 and MAPE of 10.53% in weight prediction. Figure 6 presents the comparison between the actual and predicted values of 100 samples from the test set, sorted in ascending order based on the actual values. The parameters include weight, body diagonal length, neck length, semi-diving length, keel length, chest width, chest depth, and tibia length.

Experimental results demonstrate that our model achieved excellent performance in predicting both the weight and various body dimension metrics of ducks, with results highly comparable to manual measurements. This is especially notable considering potential errors introduced during dataset collection, such as weight variations due to the ducks’ fed or fasting state and the condition of their feathers (e.g., wet or dry), as well as errors arising from manual measurements of body dimensions.

In our experiments, we systematically evaluated several backbones used for 2D image feature extraction, including VGG16, VGG19 [19], ResNet34, ResNet50, ResNet101 [18] and Xception [20], as well as Transformer-based models (ViT-B/16, ViT-L/16 [8] and Swin-T [21]). We also examined model performance variations when incorporating or excluding 3D key geometric feature extraction (GFE), a Transformer encoder (TE), and a 2D depth grey image (DGI), as summarized in Table 4. In this comparison, when the Transformer encoders were excluded, the visual features were concatenated across multiple views, followed by global pooling, and directly fed into the fully connected prediction head.

The experimental results, as shown in Table 4, demonstrate that when ResNet50 was used as the backbone, the model achieved the best results for body dimensions prediction with an R² of 0.953, MAPE of 5.73%, RMSE of 0.924, and MAE of 0.684. For weight prediction, although Xception slightly outperformed ResNet50 as a 2D image feature extraction backbone in three of the metrics, considering the overall prediction performance, we chose ResNet50 as the backbone. Transformer-based models demonstrated relatively lower performance compared to CNN-based backbones, indicating that Transformer-based backbones alone are less effective for capturing essential local spatial features with the current dataset size and task complexity. Additionally, the results validate the effectiveness of the components we proposed. Significant performance degradation was observed when 3D key geometric feature extraction and the Transformer encoder were individually excluded.

4.2. Discussion

The predictive results of this study are primarily derived from deep learning-based training and inference, with minimal reliance on explicit prior knowledge. Compared to existing studies on the prediction of weight and body dimensions of large livestock [22], our approach emphasizes a data-driven methodology, representing a significant innovation. While previous methods often integrated explicit prior knowledge into neural networks or machine learning algorithms for predicting livestock weight and body dimensions, our proposed model focuses on automatically learning relevant features from multimodal data, making it both novel and effective.

We have identified two critical factors contributing to the success of deep learning methods in predicting poultry weight and body dimensions in this study. First, as observed in [23] for the prediction of pig body dimensions, non-standard postures of animals can influence training effectiveness and prediction accuracy. This is particularly pronounced in poultry due to their greater range of motion, more active nature, and difficulty maintaining standard postures under bright lights or stress in shooting environments. To address this, we collected multiple sets of visual data from different postures of the same duck, as illustrated in Figure 7. This approach allowed the deep learning model to learn the relationships between various postures of the same individual during training. Experimental results confirmed the effectiveness of this method.

Second, the reduced reliance on explicit prior knowledge necessitates larger datasets to achieve optimal model performance. Initially, the dataset for this study was relatively small, leading to suboptimal prediction outcomes. Through continuous data collection and augmentation, the prediction results improved significantly with the increase in sample size. To illustrate this point, we compared the

R^{2}

performance among ResNet50, ResNet34, and ViT/B as 2D image feature extractors across varying dataset sizes, and presented the results in Figure 8.

Despite these achievements, our study has several limitations. Due to experimental constraints, we were unable to collect data samples covering all age groups of ducks and instead relied on ducks from specific age groups. This limitation likely contributes to the observed gaps in the actual values in Figure 6. Future studies could address this by conducting longitudinal data collection to cover the full life cycle of ducks, resulting in a larger and more diverse dataset. Such an effort is expected to further enhance the model’s predictive accuracy and robustness.

Additionally, the key points selected for analysis in Section 2.3 were derived based on their potential relevance to phenotypic features that may exhibit linear or non-linear correlations with body dimensions, rather than being directly measurable anatomical landmarks. This approach aims to balance computational feasibility and predictive utility. Nevertheless, further research is needed to explore and validate more systematic and data-driven methods for identifying optimal feature points that can better capture the geometric characteristics associated with duck body dimensions.

To deepen our analysis, we compared the proposed method with scenarios where 3D key geometric feature extraction and 2D depth grey image were excluded. The results, as shown in Table 4, reveal a noticeable performance drop as spatial information was progressively removed. To better illustrate the impact of this reduction in spatial features, we categorized the duck data based on the shortest distance from the duck to the depth camera: less than 0.5 m, between 0.5 m and 0.7 m, and greater than 0.7 m. We selected 20 data points from each of these categories and performed predictions using the proposed method, ResNet50 + TE, and ResNet50 + TE - DGI, visualized in the scatter plots in Figure 9.

Figure 9a,c show the prediction results for ducks located less than 0.5 m and greater than 0.7 m from the camera. In these cases, we observed that as spatial information was gradually removed, the predicted values tended to deviate more from the actual values, with predictions for distances less than 0.5 m tending to be larger than the actual values, while predictions for distances greater than 0.7 m tended to be smaller than the actual values. The largest deviations occurred when both 3D geometric features and depth grey image data were removed. In contrast, Figure 9b displays the results for ducks located between 0.5 and 0.7 m from the camera. Here, although there was a noticeable increase in prediction error with the removal of spatial information, the distribution of prediction errors was more uniform, with smaller overall deviations compared to the other two categories. These findings underscore the critical role of spatial information in predicting body dimensions and weight.

5. Conclusions

This study is the first to propose a method for measuring poultry body dimensions based on visual sensors and computer vision technology. By leveraging the multimodal fusion of 3D geometric features and 2D image features, it enables non-invasive and contactless estimation of the weight and body dimensions of ducks. The estimations achieved in this study are primarily based on deep learning, which, with continuous advancements in the field of computer vision, increasingly demonstrates the feasibility of relying on neural networks to extract relevant features from visual data for predicting animal weight and body dimensions.

While these results are promising, it is important to acknowledge the limitations of the current study. The model was trained on data from a single poultry species, and further research is required to assess its applicability across a broader range of species. Future studies could focus on improving the model’s ability to generalize across different datasets, optimizing feature fusion strategies, and addressing challenges such as small dataset scenarios using data augmentation or transfer learning techniques. Expanding and refining the dataset to include more varied examples will likely enhance the model’s performance and robustness.

Although the model’s primary application was demonstrated for ducks, given the shared phenotypic characteristics among various poultry species, it holds potential for adaptation to other species such as chickens and geese. However, this would require the collection of corresponding visual datasets from these species. Further validation and refinement are essential to ensure the model’s effectiveness across different poultry species. Additionally, the data collection methodology could be further improved in the future through the implementation of sensor technologies that automatically capture visual data without manual operation, which would enhance the scalability and efficiency of the measurement process while minimizing human labor requirements.

Moreover, considering that the unit cost of poultry is significantly lower than that of large livestock, the poultry industry is particularly sensitive to cost considerations. Enhancing the affordability of this technology, along with the integration of computer vision techniques into the commercial operations of poultry farming, presents a valuable opportunity for advancing the industry’s efficiency. This area warrants further research to optimize cost-effectiveness and facilitate the widespread adoption of such technologies in commercial poultry farming.

Author Contributions

Conceptualization, G.J. and Y.X.; methodology, W.X. and Q.H.; software, W.X.; validation, Q.H., G.S., G.L. and H.Z.; formal analysis, W.X.; investigation, W.X.; resources, W.X. and Q.H.; data curation, Q.H.; writing—original draft preparation, W.X. and Q.H.; writing—review and editing, S.W., Z.X., W.W. and C.L.; visualization, W.X.; supervision, G.J., G.S. and Y.X.; project administration, G.J. and Y.X.; funding acquisition, G.J. and Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Scientific Research Key Project of the Education Department of Hunan Province, China [24A0176], the National Key Research and Development Program of China [2021YFD1300404 and 2022YFD1600902-4], the National Natural Science Foundation of China [62402170], and the National College Student Innovation Training Program of China [s202410537126].

Institutional Review Board Statement

The animal study protocol was approved by the Institutional Animal Care and Use Committee (IACUC) of the Hunan Institute of Animal & Veterinary Science (HIAVS) (protocol code HIAVS-IACUC-2025-02 and date of approval 10 March 2025), covering shelducks during the period from 20 March 2026 to 31 December 2029. In addition, ethical approval support related to animal data collection and experimental conduct was also obtained from the Animal Care and Use Committee of Hunan Agricultural University (protocol code 20190602 and date of approval 9 August 2019). All animal handling procedures were designed to be non-invasive and minimize stress. No surgical, pharmacological, or behavioral interventions were involved in this study.

Data Availability Statement

The data used in this study are not publicly available due to ongoing development and potential intellectual property considerations but can be obtained from the corresponding authors upon reasonable request.

Acknowledgments

The authors would like to express their sincere gratitude to Yi Liu, Jiamei Chen, Guiming Chen, Jingwei Zhao, and Hunan Linwu Shunhua Duck Industrial Development Co., Ltd. for their invaluable assistance and contributions to the dataset collection process.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Teguia, A.; Ngandjou, H.M.; Defang, H.; Tchoumboue, J. Study of the live body weight and body characteristics of the African Muscovy duck (Caraina moschata). Trop. Anim. Health Prod. 2008, 40, 5–10. [Google Scholar] [CrossRef] [PubMed]
Kokoszyński, D.; Wasilewski, R.; Saleh, M.; Piwczyński, D.; Arpášová, H.; Hrnčar, C.; Fik, M. Growth Performance, Body Measurements, Carcass and Some Internal Organs Characteristics of Pekin Ducks. Animals 2019, 9, 963. [Google Scholar] [CrossRef] [PubMed]
Abd Aziz, N.S.N.; Mohd Daud, S.; Dziyauddin, R.A.; Adam, M.Z.; Azizan, A. A Review on Computer Vision Technology for Monitoring Poultry Farm—Application, Hardware, and Software. IEEE Access 2021, 9, 12431–12445. [Google Scholar] [CrossRef]
Zhuang, X.; Zhang, T. Detection of sick broilers by digital image processing and deep learning. Biosyst. Eng. 2019, 179, 106–116. [Google Scholar] [CrossRef]
Duan, E.; Han, G.; Zhao, S.; Ma, Y.; Lv, Y.; Bai, Z. Regulation of Meat Duck Activeness through Photoperiod Based on Deep Learning. Animals 2023, 13, 3520. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Xu, G.; Yan, F.; Wang, J.; Wang, Z. Defect transformer: An efficient hybrid transformer architecture for surface defect detection. Measurement 2023, 211, 112614. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2010, arXiv:2010.11929. [Google Scholar]
Lin, X.; Yan, Q.; Wu, C.; Chen, Y. Judgment Model of Cock Reproductive Performance based on Vison Transformer. In Proceedings of the 2022 5th International Conference on Sensors, Signal and Image Processing, SSIP ’22, Nanjing, China, 28–30 October 2022; Association for Computing Machinery: New York, NY, USA, 2023; pp. 37–42. [Google Scholar] [CrossRef]
He, P.; Chen, Z.; He, Y.; Chen, J.; Hayat, K.; Pan, J.; Lin, H. A reliable and low-cost deep learning model integrating convolutional neural network and transformer structure for fine-grained classification of chicken Eimeria species. Poult. Sci. 2023, 102, 102459. [Google Scholar] [CrossRef]
Ma, W.; Sun, Y.; Qi, X.; Xue, X.; Chang, K.; Xu, Z.; Li, M.; Wang, R.; Meng, R.; Li, Q. Computer-Vision-Based Sensing Technologies for Livestock Body Dimension Measurement: A Survey. Sensors 2024, 24, 1504. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Wu, P.; Wuyun, T.; Jiang, X.; Xuan, C.; Ma, Y. Algorithm of sheep body dimension measurement and its applications based on image analysis. Comput. Electron. Agric. 2018, 153, 33–45. [Google Scholar] [CrossRef]
Du, A.; Guo, H.; Lu, J.; Su, Y.; Ma, Q.; Ruchay, A.; Marinello, F.; Pezzuolo, A. Automatic livestock body measurement based on keypoint detection with multiple depth cameras. Comput. Electron. Agric. 2022, 198, 107059. [Google Scholar] [CrossRef]
Rusu, R.B.; Cousins, S. 3D is here: Point Cloud Library (PCL). In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 1–4. [Google Scholar] [CrossRef]
Huang, L.; Guo, H.; Rao, Q.; Hou, Z.; Li, S.; Qiu, S.; Fan, X.; Wang, H. Body Dimension Measurements of Qinchuan Cattle with Transfer Learning from LiDAR Sensing. Sensors 2019, 19, 5046. [Google Scholar] [CrossRef] [PubMed]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention*—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
Ma, W.; Qi, X.; Sun, Y.; Gao, R.; Ding, L.; Wang, R.; Peng, C.; Zhang, J.; Wu, J.; Xu, Z.; et al. Computer Vision-Based Measurement Techniques for Livestock Body Dimension and Weight: A Review. Agriculture 2024, 14, 306. [Google Scholar] [CrossRef]
Hao, H.; Jincheng, Y.; Ling, Y.; Gengyuan, C.; Sumin, Z.; Huan, Z. An improved PointNet++ point cloud segmentation model applied to automatic measurement method of pig body size. Comput. Electron. Agric. 2023, 205, 107560. [Google Scholar] [CrossRef]

Figure 1. Collection devices and their layout.

Figure 2. Capture software.

Figure 3. Seven annotated feature points on the duck’s point cloud used for geometric feature extraction, including key anatomical landmarks such as the beak tip, head peak, neck curve, and tail tip.

Figure 4. Architecture of the proposed multimodal model for predicting duck body dimensions and weight. (a) Two-dimensional feature extraction: Three independent ResNet50 models extract visual features from top-view RGB, side-view RGB, and depth images, respectively. (b) Three-dimensional feature extraction: A modified PointNet++ model performs hierarchical feature extraction from the duck point cloud through three Set Abstraction layers. The final global features are then processed by fully connected layers to predict seven anatomical keypoints in 3D coordinates. (c) Multimodal feature fusion: Image features and 3D geometric features are integrated and refined through a Transformer encoder, enabling accurate prediction by capturing global feature dependencies across different data modalities.

Figure 5. Segmentation results of the duck’s side-view.

Figure 6. Paired scatter plots for eight parameters: weight (a), body diagonal length (b), neck length (c), semi-diving length (d), keel length (e), chest width (f), chest depth (g), and tibia length (h).

Figure 7. Three side-view RGB images (segmented) of the same duck in different poses.

Figure 8. Comparison of

R^{2}

performance between ResNet50 and ResNet34 as 2D image feature extractors under varying dataset sizes.

Figure 8. Comparison of

R^{2}

performance between ResNet50 and ResNet34 as 2D image feature extractors under varying dataset sizes.

Figure 9. Scatter plots of predicted versus actual weight for various proximity categories of ducks. (a) Less than 0.5 m. (b) Between 0.5 and 0.7 m. (c) Greater than 0.7 m.

Table 1. Duck morphometric parameters with corresponding units and definitions.

Parameter	Unit	Definition
Weight	g	The overall weight of the duck.
Body Diagonal Length	cm	The diagonal length from the tip of the beak to the tail.
Neck Length	cm	The length of the duck’s neck, from the base to the head.
Semi-Diving Length	cm	The depth the duck’s body when it enters the water while diving.
Keel Length	cm	The length of the duck’s keel bone, influencing chest development.
Chest Width	cm	The width of the duck’s chest, indicating chest development.
Chest Depth	cm	The vertical distance from the back to the abdomen, reflecting chest depth.
Tibia Length	cm	The length of the duck’s tibia, associated with its mobility.

Table 2. Hyperparameters used for model training.

Hyperparameter	Value
Learning Rate	1 $\times 10^{- 4}$
Batch Size	32
Optimizer	Adam
Weight Decay	1 $\times 10^{- 4}$
Epochs	50
Learning Rate Scheduler	StepLR
Loss Function	MSE

Table 3. Performance metrics for duck weight and body dimensions in test set.

Morphometric Parameters	R² ↑	MAPE (%) ↓	RMSE ↓	MAE ↓
Weight (g)	0.952	10.49	135.0	96.63
Body Diagonal Length (cm)	0.968	5.17	0.813	0.651
Neck Length (cm)	0.927	5.89	1.120	0.804
Semi-Diving Length (cm)	0.966	4.32	2.298	1.687
Keel Length (cm)	0.973	6.77	0.773	0.577
Chest Width (cm)	0.952	6.34	0.563	0.387
Chest Depth (cm)	0.931	6.92	0.517	0.385
Tibia Length (cm)	0.955	4.74	0.381	0.297
Overall (Body Dimensions)	0.953	5.73	0.924	0.684

Table 4. Performance metrics for different model architectures.

Model	Body Dimensions Avg.				Weight
Model	R²↑	MAPE (%)↓	RMSE↓	MAE↓	R²↑	MAPE (%)↓	RMSE↓	MAE↓
VGG16 + GFE + TE	0.928	7.53	1.045	0.789	0.935	18.29	150.2	117.53
VGG19 + GFE + TE	0.933	7.07	0.929	0.822	0.919	14.24	138.4	114.16
ViT-L/16 + GFE + TE	0.843	11.64	1.630	1.277	0.755	40.62	291.7	245.73
ViT-B/16 + GFE + TE	0.845	9.44	1.484	1.086	0.834	17.67	214.9	152.48
Swin-T + GFE + TE	0.943	6.32	0.962	0.692	0.922	12.58	138.2	97.24
Xception + GFE + TE	0.896	7.46	1.250	0.875	0.953	10.87	131.0	92.51
ResNet34 + GFE + TE	0.882	10.07	1.413	1.113	0.932	16.42	154.2	123.32
ResNet101 + GFE + TE	0.933	6.64	0.948	0.722	0.947	13.23	139.1	102.07
ResNet50 + TE	0.903	7.13	1.121	0.822	0.903	13.28	143.8	96.88
ResNet50 + TE - DGI	0.710	22.13	2.34	2.31	0.694	38.19	301.2	281.88
ResNet50 Only	0.808	15.21	2.081	1.918	0.850	30.14	307.8	267.45
ResNet50 + GFE + TE	0.953	5.73	0.924	0.684	0.952	10.53	135.0	96.65

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiao, W.; Han, Q.; Shu, G.; Liang, G.; Zhang, H.; Wang, S.; Xu, Z.; Wan, W.; Li, C.; Jiang, G.; et al. Multimodal Feature-Driven Deep Learning for the Prediction of Duck Body Dimensions and Weight. Agriculture 2025, 15, 1021. https://doi.org/10.3390/agriculture15101021

AMA Style

Xiao W, Han Q, Shu G, Liang G, Zhang H, Wang S, Xu Z, Wan W, Li C, Jiang G, et al. Multimodal Feature-Driven Deep Learning for the Prediction of Duck Body Dimensions and Weight. Agriculture. 2025; 15(10):1021. https://doi.org/10.3390/agriculture15101021

Chicago/Turabian Style

Xiao, Wenbo, Qiannan Han, Gang Shu, Guiping Liang, Hongyan Zhang, Song Wang, Zhihao Xu, Weican Wan, Chuang Li, Guitao Jiang, and et al. 2025. "Multimodal Feature-Driven Deep Learning for the Prediction of Duck Body Dimensions and Weight" Agriculture 15, no. 10: 1021. https://doi.org/10.3390/agriculture15101021

APA Style

Xiao, W., Han, Q., Shu, G., Liang, G., Zhang, H., Wang, S., Xu, Z., Wan, W., Li, C., Jiang, G., & Xiao, Y. (2025). Multimodal Feature-Driven Deep Learning for the Prediction of Duck Body Dimensions and Weight. Agriculture, 15(10), 1021. https://doi.org/10.3390/agriculture15101021

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Feature-Driven Deep Learning for the Prediction of Duck Body Dimensions and Weight

Abstract

1. Introduction

2. Data Processing

2.1. Dataset Description

2.2. Collection Method

2.3. Features Extraction

2.4. Data Prepossessing

3. Method

4. Results and Discussions

4.1. Results

4.2. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI