1. Introduction
Extrusion is a common method for producing texturized vegetable proteins (TVP) or plant-based meat. Through the combined effects of heat, pressure, shear, and moisture, such plant-derived proteins are transformed through this process into fibrous, meat-like structures. This technology is valued not only for its ability to replicate the texture of meat, but also for its potential to deliver environmentally sustainable, high-protein foods. Diverse protein sources, such as soy, pea and hemp, have been used for TVP products, which underscores the adaptability of the extrusion process [
1,
2,
3].
Microstructural properties are key factors that influence various textural signatures of the product, such as mechanical strength, chewiness, and springiness [
4]. The degree of fibrosity (i.e., fibrousness) of TVP products is often evaluated subjectively from product images [
2,
5,
6,
7,
8].
The underlying goal of this research is to develop a model that can objectively assess the fibrosities of plant-based products from input images. Such an automated scheme to provide numerical output scores would help in reducing—if not eliminating, the need for subjective human inspection during the extrusion process. Hence, this research is a major step towards the development of better structures and textures, as well as for process control automation, obviating the need for intermittent human inspection.
Research articles on methods that treat microstructural features of TVP in a similar objective manner have begun to appear. A review article on analytical approaches for assessing plant-based meat analogs, including microstructure analysis using image processing algorithms to obtain fiber index values, has been published in [
9]. An automated image analysis method (i.e., Fiberlyzer) to quantify fibrosities of plant-based meats was proposed in [
10]. Strong correlations between computed fiber scores and expert panel evaluations demonstrate the effectiveness of this approach, thereby illustrating how computer vision can be leveraged for objective assessment. A non-destructive, laser transmission method using computer vision to quantify the degrees of orientation in fibrous foods has been proposed in [
11]. This technique was shown to reliably capture structural alignment—a feature associated with mechanical texture and consumer acceptance, thereby offering a more objective alternative to traditional visual inspection. The relationship between structural and mechanical anisotropy in plant-based meat products has also been examined [
12]. This study, which draws on X-ray scattering and scanning electron microscopy, demonstrates how high protein content and controlled processing conditions promote fibrous alignment and mechanical strength. Microstructural anisotropy indices served as robust indicators of product quality in this research. More recently, an integrated framework to obtain TVP fibrosity scores from extrusion parameters, was explored in [
13]. This approach, which involves machine learning and computer vision, can be adopted in realizing optimal process control in real time.
A deep neural network (DNN), such as that proposed in this research, is a trainable machine learning model that is roughly organized in the manner of the human cortex [
14]. It consists of several layers of array processing. The first layer (input layer) acquires the DNN’s input, which is passed onto the next “hidden” layer. Each such hidden layer receives an input array from its immediately preceding layer and obtains an intermediate output that is supplied to the next layer. After several layers of processing, the final layer (output layer) produces the output of the DNN. The DNN’s weights, biases and other internal parameters can be iteratively optimized by means of a suitable learning algorithm. Classification and regression are the two broad categories of supervised learning. In classification, the DNN produces discretized outputs, while the outputs are continuous quantities in regression tasks. Since this research involves regression, the outputs of the proposed DNN are real quantities between 1 and 10 representing estimated fibrosity scores of input images. These images are obtained from plant-based meat products with varying textural attributes.
Recent years have witnessed an explosive growth in the popularity of DNNs. DNNs have been highly successful in a wide variety of applications, such as home automation [
15], agriculture [
16], large language models [
17], cybersecurity [
18], automated ground drones [
19], automated traffic networks [
20], defense [
21], blockchains [
22], and robotic control [
23].
DNNs have been applied to various food-related image tasks, such as evaluating fish quality [
24,
25], predicting the soluble solid content of sweet potato [
26], classifying tea leaf samples [
27], classifying rice samples [
28,
29], and detecting cracks in wheat kernel images [
29]. A significant amount of research attention is directed at gleaning coherent explanations from DNNs which are usually treated as black box models [
30,
31].
A residual network (ResNet) is a class of DNNs that was first proposed for image classification [
32]. A unique feature of ResNets is the presence of residual connections. This feature allows a hidden layer to deliver its output simultaneously to two downstream layers. In general, ResNets incorporate several hidden layers with residual connections.
ResNets with 18, 34, 50, and 150 layers were investigated in [
32], where they consistently outperformed traditional DNNs with up to 1000 layers. Theoretical treatment provides further insights for the superior performances of ResNets [
33,
34]. They have been successfully used in a wide variety of image processing [
33,
35]. ResNet architectures have also been adopted for food-related image processing. ResNets with 18 layers (ResNet-18) have been considered for such applications [
29,
36,
37,
38]. ResNets with 18 as well as 34 layers have been explored in [
39]. Larger ResNets with 50 layers have been proposed elsewhere [
40,
41]. A ResNet with 101 layers has been studied [
42]. In all these cases, ResNets were used for classification tasks. However, in a recently published article [
43], the output of an open source ResNet-18 was explored for further statistical treatment also including regression analysis.
In this research, a ResNet-18 model that had been pre-trained for image classification [
44] was suitably modified to perform regression. A new layer was incorporated into the DNN, while its input layer was enlarged to handle larger images. Using data collected for this investigation, a few layers of the original ResNet-18 were retrained for regression. This technique, called transfer learning, is used to curtail the needed training time [
28,
40]. Recent research reports the use of transfer learning for a similar application [
45].
The next section describes at greater length the data collection methodology, image preprocessing, augmentation and human scoring of real images, the generation of synthetic images, as well as the layout and re-training of the 19 layered ResNet model that was developed as part of this research.
2. Methods
2.1. Generation of TVP Products
Three fava bean concentrate-based (45%) formulations, each containing soy protein concentrate (11%) but different sources of complementary plant proteins (44%), viz., pea protein isolate, soy protein isolate or wheat gluten, were extruded under different processing conditions to generate TVP products with varying fibrous microstructures. The protein contents of fava bean concentrate (Ingredion, Westchester, IL, USA) and soy protein concentrate (ADM, Quincy, IL, USA) were 60% and 72%. In pea protein isolate (Puris, Minneapolis, MN, USA), soy protein isolate (ADM, Quincy, IL, USA) and wheat gluten (Royal Ingredients Group, Alkmaar, The Netherlands), the contents were 80%, 90% and 82%. Thus, the net protein contents were in the range of 70.1–74.5% in the three formulations. The selection and combination of ingredients were informed by prior work [
1], emphasizing the role of protein type and ratio in controlling the structural properties of the extruded plant-based meat products.
The three formulations were processed using a pilot-scale co-rotating twin-screw extruder (TX-52, Wenger Manufacturing, Sabetha, KS, USA), with a 52 mm screw diameter and a length to diameter (L/D) ratio of 19.5. The extruder comprised four-barrel zones, with temperatures set to 30 °C, 50 °C, 80 °C, and 110 °C from the feed section to the die end. A constant feed rate of 50 kg/h was maintained for all treatments, and the screw speed was fixed at 450 rpm. An aggressive screw configuration was selected, including cut flight, reverse, and kneading block elements, to achieve high shear and mechanical energy input necessary for protein texturization [
1].
A venturi die of thickness ¼ inch was used to enhance shearing before the material flowed through dual ¼ inch outlet dies. The extrudate was then cut into pieces using a rotary knife system with three blades. The cut-extruded products were conveyed to a dual pass dryer (Series 4800, Wenger Manufacturing, Sabetha, KS, USA) and dried at 113 °C for 14 min, followed by 5 min of ambient air cooling. Each formulation was processed under different extrusion in-barrel moisture content conditions (ranging from 29.2 to 40.9% wet basis), resulting in six distinct plant-based meat extrusion treatments and corresponding products. Product collection from the dryer was done at various times during the processing, from which 63 TVP pieces were selected, which were spread over the six treatments.
These samples were utilized for image analysis, thereafter, to investigate the accuracy of the proposed DNN in estimating the samples’ fibrosity scores.
2.2. Data Acquisition: Real Images
To analyze the internal structure of the extruded textured vegetable protein (TVP) products, high-resolution macro images were captured using a Nikon D750 digital camera equipped with a 105 mm macro lens and SB-R200 wireless remote flash (Tokyo, Japan). The imaging setup included a Kaiser Copy Stand RS1, two Dracast Camlux Pro LED light panels, and an 18% grey card as the background to ensure standardized lighting and color balance. Image acquisition was conducted using CaptureOne software (version 10.1.1.5, Phase One, Copenhagen, Denmark).
Prior to imaging, dried TVP samples were rehydrated in tap water for 30 min and then drained for a duration of five minutes. Out of a total of 63 TVP samples that were collected, 18 hydrated pieces were sliced both longitudinally and transversely (relative to the direction of extrusion) in order to expose internal structural features. This procedure, which allowed for the visual inspection of cross-linking and layering densities in different directions, was used to capture 36 images for subsequent analysis. The remaining TVP samples were horizontally sliced, thus providing 45 additional time-series image samples. A total of raw images were acquired in this manner.
Each image , () was in the form of a three-dimensional array of 32-bit unsigned pixels, i.e., , where is the raw image size. Additionally, several synthetic images were created using image software. The real images were subject to further treatment as outlined next.
2.3. Data Preparation: Real Images
In order to isolate the ‘figure’ (i.e., the portion showing the food matrix) from the ‘background’ in each of the of
raw images, a suitable threshold was applied in a pixelwise manner, and those below it were recolored with black in order to remove background clutter and isolate the region’s relevant ‘figure’ portion. The raw images
(
) were zero-padded so that they were square shaped with identical horizontal and vertical sizes
. The relevant ‘figure’ of each image was translated along the x and y axes so that its centroid coincided with the image’s mid-point. This preprocessing step ensured that all
images were properly aligned. The preprocessed images’ horizontal and vertical sizes of
pixels, which was ~25% that of the largest raw image, were small enough to serve as DNN inputs while also retaining all textural features. For comparison, in another application also involving plant-based meat analogs [
45], the input images to a ResNet-18 were of size
, i.e., an order of magnitude smaller than the present ones.
Figure 1 shows two examples of raw images (top row) along with the corresponding preprocessed images (bottom row).
Since the number of samples was relatively sparse, spatial data augmentation was carried out [
46] before training the DNN. Similar spatial operations are routinely used for traditional image augmentation [
47]. Image augmentation methods have also been applied in food processing [
48]. In this research, each processed image
was subject to reflection (i.e., mirror image), as well as rotations of 0°, 90°, 180°, and 270°. These spatial operations generated
samples from each
, resulting in a total of
input samples. Each such re-oriented image will be represented as
(
).
Figure 2 shows all eight spatial orientations of an image. It can be observed in the figure that the centroids in all the
images are either in perfect alignment or have a discrepancy of
pixels.
2.4. Human Scoring: Real Images
The images were assessed for quality by two human experts ( and ) with substantial academic research experience in plant-based meat production. Each subject provided a score for each image, on a scale of 1 through 10, with a higher score indicating more fibrosity. In order to account for discrepancies in human judgment, scores were obtained through multiple sessions that were scheduled on different dates. A total of six sessions were conducted (two with subject , four with subject ).
A MATLAB program was developed for this purpose. During each session, the program displayed on screen for a subject all
images. They were displayed sequentially but in random order. Furthermore, for each image
only one was picked randomly and without repetition from the
possible orientations
. The subject was provided online keyboard entry and a score
, where
is a subject and
is a session index, so that
, and
. The mean score
of each image
was obtained separately for each subject
as
For each image , the set of individual session scores , as well as their mean score , were stored as the first three fields of the datasets and .
Due to the limited number of sessions per subject, not all image-orientation pairs could be manually scored during the interactive sessions. All such pairs were assigned scores randomly from the corresponding scored pairs. Moreover, preliminary simulations indicated that dissociating the orientations of the images from their manual scores imparted robustness to the trained DNN. Specifically, for each image and each orientation , a score was drawn randomly and without replacement from the existing ones, through . The session index was a uniformly distributed random number. Accordingly, each was assigned an individual human score . The sets of pairs were included as the fourth and final fields in the datasets and .
In this manner two complete sets of data,
and
, were obtained. As a reference for subsequent sections, the generic format is as shown below:
The superscript refers to a subject, so that . The subscript is a session index, while the subscript denotes an orientation. The redundancy in Equation (2) is intended for clarity and does not reflect the true format of the data that was stored in computer memory.
2.5. Data Preparation: Synthetic Images
The immediate purpose of synthesizing additional images was to ensure that the trained DNN was free of inductive bias [
49], i.e., that its output estimates were independent of any extraneous features in the real image samples. Inductive bias in DNNs, where they learn to pick artificial cues from their training datasets, has been long identified as a problem in supervised learning tasks [
49,
50,
51]. Although bias in homogeneous DNNs has been extensively studied (Vardi, 2023), it is not well understood in the context of heterogeneous DNNs, including ResNets [
52].
More broadly, synthetic images would allow the DNN’s output estimation to be more interpretable (explainable). Explainable AI is a topic of significant interest [
31,
36,
53]. Explainable AI methods have been explored in image processing [
54,
55,
56].
To ensure that the DNN was not sensitive to irrelevant image features, and to render its estimation more interpretable, a total of synthetic images were created. Each image was assigned a unique index number between 1 and 30. Based on their shapes, the synthetic images fell under the following four categories, (i) “large circle” (LC), (ii) “box” (BO), (iii) “ellipse” (EL), and (iv) “small circle” (SC).
Figure 3 shows all 30 synthetic images. The relevant ‘figure’ region of each image that represented the food matrix was colored orange so that it resembled the analogous portion of a real image. The smaller, darker objects of different shapes and sizes within the ‘figure’ represent air cells of a real image counterpart. The white rectangular box appearing at the top left of each image in
Figure 3 shows the image number (between 1 and 30). Below it and in the same box is the synthetic image’s estimated granularity score (described later). It should be noted that the images that were used as inputs to the DNN did not contain these boxes. Row-1 (top row) of
Figure 3 contains LC images, 19, 1, 20, 8, 21, 7, 3, 2. Row-2 contains BO images, 15, 14, 12, 22, 13, 9, 11, 10. Row-3 shows EL images, 27, 30, 29, 26, 28, 23, 24, 25. Row-4 (bottom row) has SC images, 18, 17, 5, 6, 4, 16. The images in each row are arranged in decreasing order of their granularity scores, from best (left) to worst (right).
Each synthetic image was subject to reflection and rotations at intervals of 22.5°, thereby providing orientations per synthetic image. This was done to obtain a statistically large number of samples from each synthetic image. Accordingly, a total of synthetic images were available for further investigation.
2.6. Deep Neural Network
This section describes the main aspects of the enhanced ResNet used in this research. The DNN’s input is a color image denoted as where is the image size. Although pixels of raw images are integers between and , they are subject to rescaling and shifts internally in the DNN—an issue that is not addressed here. The output of the last layer is a scalar () representing the estimated fibrosity score of the input image, the corresponding true value being represented as .
The following passages provide brief descriptions of the layer types and functions.
2.6.1. Convolution Layer
Convolution is very commonly used in digital signal processing as well as in classical image processing. In image processing, it is applied for various spatial operations, such as edge detection, contrast enhancement, and noise removal [
57]. A two-dimensional convolution on an array input
, using a filter
, yields an output array
, where
is the filter size (
is an odd number) and
is the stride.
For simplicity, let us assume that the horizontal and vertical sizes of array
(
and
) are multiples of
, and ignore boundary level image readjustments. Convolution is carried out according to the following expression:
Array indices lie between 1 and in the above expression, which ignores boundary adjustments. It can be seen that convolution reduces the input’s horizontal and vertical sizes by the same factor . The symbol ‘’ is used to denote the convolution operator, so that the above relationship can be expressed concisely as, .
Processing in a convolution layer (
Conv) takes place concurrently across multiple input and output channels. Channels have their own, equally sized
filters, and identical strides
. Let
be the index of an input channel, and
, that of an output channel [
58,
59]. The convolved array
of output channel
is the summation of
input arrays. Each such array is obtained by convolving input
with filter
as below:
Since the input to the proposed DNN is a color image, each color (red, green, blue) may be regarded as an input channel of the first convolution layer, i.e.,
. Downstream image processing layers have significantly more input and output channels. DNNs with multiple convolution layers are routinely used with various food processing applications [
25,
26,
27,
60].
A thresholding operation is applied in order to ensure that the scalar elements of the layer’s array output are non-negative. If
are array indices, the thresholded output
of output channel
is
Thresholding is implemented by means of a ReLU (Rectified Linear Unit) layer [
57]. Accordingly, the sequence of operations to obtain any channel output
from its input
requires that the convolution layer be followed by a ReLU layer. However, it is sufficient for our purpose to assume that thresholding takes place internally within the convolution layer itself, a simpler convention that is often adopted frequently in most of the published DNN literature.
Convolution layers are followed by a pooling layer. The two most commonly used pooling operations are max-pooling and average-pooling. The proposed DNN contains layers for both these types of pooling.
2.6.2. Max-Pooling Layer
Pooling is necessary to lower downstream processing (and training) requirements to computationally tractable levels [
61].
The max-pooling layer (MAXPOOL) has the same number of input channels and output channels,
. Its output array
is obtained by taking elementwise maxima of
pixels of its input
, in the manner shown below, where overlaps and/or unused pixels determined by the stride:
2.6.3. Average-Pooling Layer
Average-pooling replaces elementwise maximums with averages. An average-pooling layer (AVGPOOL) has
input channels, with each carrying a two-dimensional input. However, the layer’s output is a one-dimensional vector that is delivered to a downstream fully connected layer for further processing. Due to this reason, instead of outlining a generic average pooling layer, we focus specifically on the layer incorporated in the proposed DNN. If the input of channel
is the array
, the
th element of the layer’s output vector
is given by
The proposed DNN contains only a single average-pooling layer as the last image processing stage. Subsequent layers involve high-level vector processing.
2.6.4. Fully Connected Layer
The input to a fully connected layer (FC) is in the form of a one-dimensional array. Its output can be either another array or a scalar. Let and be the input and output vectors of a fully connected layer. A scalar output can be perceived as a specific case where . The parameters associated with a fully connected layer are a bias vector and a weight matrix .
An activation vector
is computed internally as
. The FC layer’s output
is determined by applying a (piecewise) continuous, monotonic, and bounded nonlinear function
, i.e.,
[
62,
63]. In the proposed DNN, the output is obtained by imposing a lower threshold on
by means of elementwise ReLu operations.
More specifically, if
is the
th scalar element of
, and
, the
th column of
, the activation is obtained in the following manner,
The scalar activation in the RHS of the above expression is thresholded by means of a ReLu nonlinearity, whence the
th scalar output is given as
2.6.5. Residual Connection
The key feature of ResNet is the presence of residual connections, which operate on two different array inputs. One input is the output of the immediately preceding layer. The other input is the output of any other downstream layer. For instance, if there is a residual connection before layer , the two input arrays are the outputs of layers and (). In this case, we say that the output from the latter “skips” layers.
The symbols
and
represent the two input arrays, where the latter skips one or more layers. The output
is of the same size as
, whereas
, which skips some layers, may be an array of larger size. If the size of
exceeds that of
and
, it is subject to down-sampling. In the existing literature on ResNet DNNs, down-sampling is invariably referred to as
convolution [
32], although it does not involve any associated filter.
Down-sampling is applied, when needed, to reduce the size of the skipped input
by a factor
, i.e., the stride. This is accomplished by taking regularly spaced samples of
at each channel
, to yield another array
whose size matches that of
,
The output
is obtained by adding
and
,
When both inputs to the residual connection are equally sized, no down-sampling is needed. This can be viewed as down-sampling with
so that
when
Residual connections significantly reduce the total number of layers needed by the DNN, which in turn lowers the latter’s training time. If
is the skipped input (
), then
is computed by subjecting
to several layers of processing so that
, where the map
entails some form of nonlinear image processing. To see the usefulness of a residual connection, assume that
depicts some spatial blurring operation [
32,
34]. Replacing
with
, the result
is an edge image. In other words, this residual connection serves as an edge detector. Deeper layers in the DNN that are involved in image processing can readily extract edge-related information, obviating the need for multiple other image processing layers. This is why the number of layers needed by a ResNet is lower than that of a classical DNN for a comparable task. Fortuitously, this reduction also decreases the overall computational time required to train the ResNet.
In spite of regarding down-sampling as convolution, the published research on ResNet architectures typically does not treat residual connections as separate layers—a convention that is adopted throughout this article.
2.6.6. DNN Layout
Ignoring boundary processing, a convolution layer (Conv) can be fully characterized in terms of the number of output channels , the filter size , and the stride . This is also the case with a max-pooling layer (MaxPool), where is now interpreted as a window size. So long as the size of the input image is known, a Conv or MaxPool layer’s output size can be readily obtained from . An average-pooling layer (AVGPOOL) is completely specified in terms of the size of its output vector . In a similar manner, the output size alone suffices to describe the layout of any fully connected layer (FC). The only determinant of down-sampling is the stride .
Figure 4 illustrates the architecture of the modified ResNet that was developed for this research. Layers are represented as colored rectangles. Parametric constants of each
Conv and MAXPOOL layer are provided inside the rectangles, and in the format
, which is consistent with published research. The layer’s output size is shown below it.
The DNN’s input image undergoes several layers of image processing. After the initial Conv layer and MAXPOOL layer, downstream image processing layers are grouped into four blocks, and each such block comprises four Conv layers with identical output sizes. The layers in each block are shown as rectangles with the same color. All connections are shown as red arrows. The strides of residual connections (Equation (8)) are provided in the format and enclosed within small blue squares. The pixelwise additions involved in Equations (9) and (10) are depicted as blue dots.
Two FC layers follow the final image processing AVGPOOL layer. They are the only trainable layers in the DNN. The second FC layer, which contains a single neuronal unit, determines the overall DNN output, which is the estimated quality score .
2.6.7. DNN Training
Only the two FC layers of the DNN were trained. Samples were drawn randomly from some dataset and divided in the standard ratio of 85:15 into two: a training set and a test set , where , , , and .
Referring to Equation (2), the last field in
, which was of the form
, was used to train the DNN. An image
was drawn at random to serve as the input to the DNN, and its output
was the corresponding estimated score. The purpose of training was to adjust the FC weights and biases until the estimates were as close as possible to the real subjects’ scores. The sum squared error loss shown below was used for minimization:
Sum squared error loss functions are routinely used in training algorithms for regression [
19]. Current DNN training algorithms add regularization terms to the loss [
64].
Details of the training algorithm are not provided here, as they are standardized aspects that are built-in within Pytorch [
44] and the Torchvision package [
65]. It suffices for our purpose to merely mention that a form of stochastic gradient descent was applied to minimize the loss in Equation (13). An epoch is a single pass through all training samples. The weights of the FC layers were updated incrementally through several epochs, with an up-to-date learning method based on the stochastic gradient descent rule [
66],
Although the learning rate in the above is depicted as a constant, in reality it varies across layers and is progressively reduced with training epoch.
State-of-the-art DNN learning algorithms offer several improvements over classical stochastic gradient descent, such as batch normalization, dropout, and other schemes. For further details, the reader is referred elsewhere [
14]. These features are an integral part of Pytorch software. The code internally sets aside a proportion of samples from
for validation. Suitable features were selected during the DNN training.
The DNN was trained using the ADAM optimizer [
44,
65]. During training, dropout layers were added to the FC layers. As dropout layers were not required beyond the training stage, they are not shown in
Figure 4. Regularization techniques were employed to improve generalization and prevent overfitting.
The significant training parameters were as follows. The learning rate was kept at . The weight decay (L2 regularization) was set to 0.001. The dropout rate after the FC layers was set to 0.5. Additionally, a learning rate scheduler (ReduceLROnPlateau) was applied to reduce the learning rate by a factor of 0.5 whenever the validation loss would not decrease for three consecutive epochs. Early stopping was implemented with a patience of 20 epochs, ensuring that training halted once the performance began to plateau. The DNN was trained for up to 5000 epochs, with batch sizes of 8 and 32 for the training and validation datasets. Other secondary aspects of DNN training, which did not play any significant role, therefore have not been addressed in this article.
2.7. Statistical Metrics
In accordance with prior research [
19], the statistical metrics that were adopted in this research fall under three categories: (
i) error norm metrics (
,
), (
ii) goodness-of-fit metrics (
,
), and (
iii) linear regression metrics (
,
,
). The goodness-of-fit metrics use score means whose underlying expressions are as given below:
Depending on the dataset, the quantity may refer either to one of the two subject’s scores, or , or to their weighted mean. Brief descriptions of each category follow.
(
i) Error Norm: The two norm-based errors used in this research are the mean squared error
, and the averaged absolute error
. They are normalizations of the squared L
2 norm (Euclidean distance) and they L
1 norm (Manhattan distance). The errors are defined as below:
The ideal case, i.e., when the estimates are accurate (), and .
(
ii) Goodness-of-Fit: The coefficients of determination
, and correlation
, are as shown in the expressions below:
The quantities and in the RHS of the above expressions are obtained from Equation (15). The ideal values of the coefficients are and .
(
iii) Linear Regression: Linear regression is applied with the y-intercept constrained to zero to obtain a straight line of slope
passing through the origin. It is also applied without this constraint to obtain a line with slope
and y-intercept
. Mathematically,
The best outcome is when the slopes are and , and the y-intercept is .
4. Discussion
The salient contributions of this research are threefold and are as outlined in the following paragraphs.
Firstly, it was demonstrated that a DNN can be successfully applied to estimate granularities from input images in the manner of human experts. This was evidenced from the results with real images in
Section 3. In spite of limited image samples and prior human scores, the DNN could be trained for this purpose, whose accuracy is reflected through multiple statistical performance metrics. This task was accomplished using a suitable ResNet-18 layout with an additional layer, combined with appropriate spatial image preprocessing, data enhancement, and transfer learning. Although ResNets are routinely used for similar applications, to the best of the authors’ knowledge, the DNN in this research is the first to be developed for the regression task of estimating fibrosities of meat analogs.
Next, close examination of the differences between the subjects’ scoring pattern and the DNN’s significantly better performance when separately trained with each subject scores suggested that the DNN could integrate into its estimation parameters more subtle aspects of human scores. Although the outcome is not conclusive, the authors believe that it would be worthwhile to extend this study with the collection and statistical analysis of more subject scores and how they correlate with various image properties, as well as to account for the extent of implicit, perceptual bias.
Lastly, the outcome of the experiment with synthetic images is noteworthy. In the authors’ views, the DNN’s estimated granularity scores followed a remarkably consistent pattern that was amenable to simple, straightforward interpretation in terms of features of the input images. The study strongly suggests that the DNN’s estimation scheme was based on the extent of coverage provided to the food matrix by the air cells contained in it, the number of them present and their elongations.
Needless to say, this research is not without limitations. Although it highlights the feasibility of using such DNNs to assess the granularities of extruded plant-based meat products from camera images, sans human intervention, all real images used here were obtained solely by the present team. An in-depth analysis of human assessment would have been possible by collecting subject scores from a larger group of human experts. The DNN’s estimates was interpreted through visual observations. Quantifying the matrix and cell properties in the synthetic images would have allowed for more mathematically rigorous interpretation analysis.
5. Conclusions
This research demonstrates the effectiveness of the proposed DNN, which is an extension of ResNet-18, in estimating the fibrosities of plant-based meat analogs from camera images. It was shown that with only a reasonably limited amount of data and appropriate augmentation, the DNN could be trained to provide estimates with a high degree of accuracy. Simulation results with real images illustrate that this DNN was capable of incorporating perceptual elements present in human assessment of plant-based meat quality.
Human scores were used only for the DNN’s training and evaluation; considering the possibility that some deeper aspects of human assessment may be dauntingly complex for this research [
67], their underlying perceptual basis remains outside the scope of this study. This is unlike the approach taken in [
10], where computer vision algorithms were applied to obtain a set of prespecified textural attributes, which were correlated with human visual inspection. Instead of selecting
a priori only some features for investigation, a holistic approach has been adopted here. Only limited fine-tuning with additional data is needed to customize a DNN for other plant-based meat analogs, as well as for other desired textural features. Traditional computer vision approaches do not offer this kind of flexibility.
Analysis of the DNN’s scores with synthetic image inputs illustrates that an undue amount of experimental data is not needed to elicit high-performance accuracy. This task can be achieved by selecting a suitable layout (e.g., the extended ResNet layout proposed here) and appropriate data preprocessing, augmentation, and transfer learning steps. Furthermore, interpreting this experiment’s outcome, suggests that this scheme endowed the DNN with the ability to discern intrinsic, perceptual differences in human experts and be free of bias [
52].
Future research can be pursued along several directions. Fibrosities of plant-based meat products are influenced by multiple spatial elements present in their food matrices and air cells. All these features can be quantified using suitable image segmentation and labeling algorithms [
68]. While they can be integrated into a single, empirical fibrosity measure per image sample, the authors’ plan to explore using Pareto optimality—a concept widely used in multicriteria decision-making research [
69]—as an alternative criterion to assess fibrosities. Active learning can be investigated for continual re-adaptation under changing external conditions [
70].
Computational tools are available for the purpose of DNN explainability [
54,
55]. Many of these methods are model-agnostic, i.e., they treat machine learning models as black boxes. However, other methods, which are specific to DNNs, are also available [
31,
71]. They are used to impart explainable elements while training the DNN. A suitable method can be adopted to train a ResNet with the proposed layout, such that its input–output mapping would be more tractable for explainability analysis.
Lastly, research should also be aimed towards the fully automated optimization of extrusion process control parameters. This goal would require the use of reinforcement learning. Up-to-date deep reinforcement learning models, which are equipped with one or more DNNs, have met with a great deal of success in a wide variety of complex, real-world applications [
15,
20,
23,
72].