1. Introduction
CNNs have revolutionized many domains in computer vision that include object recognition, facial detection and texture classification [
1]. One emerging area where CNNs show strong promise is in the classification, detection, and identification of textiles, specifically fabrics [
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12], which contributes to the field of textile manufacturing, automated quality inspection and smart material recognition. With growing interest in identifying and classifying fabrics, image-based approaches can provide a non-invasive, low-cost, and scalable solution. However, current implementations overlook the physical image acquisition process, particularly the effect of the focal length and distance at which images are captured. This study is conducted based on the idea that real-world image capture conditions, such as focal length variation, could significantly influence CNN performance by providing a diversity in the level of texture details captured in fabric images.
Recent studies in fabric classification have largely focused on the utilization of CNNs to showcase their ability to classify and identify fabrics or to identify certain fabrics with a focus on fault, defect and damage detection [
3], where the researchers classified different blends of Abaca fabrics, and in [
6,
7], where pineapple fabrics and pineapple cotton fabric blends were the main focus of the developed classification systems based on CNN. Similarly, the identification of barong Tagalog textiles using CNNs was conducted in Ref. [
8]. Other research aimed to improve textile manufacturing processes by developing systems to detect fabric faults and defects as presented in Ref. [
4] where fabric anomalies was detected to enhance textile quality assurance, Ref. [
5] where detection was applied in a real-world manufacturing setup, and Ref. [
9] where classification of woven fabric faults was explored and [
10] where fabric defects are detection using deep learning. Although these works demonstrate the effectiveness of CNNs in fabric classification and identification tasks, they do not address hardware-based image acquisition techniques or methodologies. Most rely heavily on digital zoom and scaling as proxies for image diversity, while treating all images as uniformly representative of the target material. Previous methods rely on digital zoom and scaling as substitutes for image diversity, treating all images as uniformly representative of the target material.
While such methods have yielded promising results in certain contexts, results in Ref. [
8] indicate limited accuracy, with the system trained to identify Barong Tagalog textiles achieving only 71.10%. This highlights the need for improved image acquisition strategies to enhance model generalization. Moreover, the researchers have not examined how optical variations at the point of capture, particularly differences in camera distance and focus, influence classification performance. Evidence from other domains, such as medical imaging and remote sensing, shows that physical variations in image scale and focus significantly affect CNN learning [
12,
13]. However, similar investigations in fabric recognition remain scarce.
Brown et al. [
14] analyzed the impact of camera choice on image classification tasks by evaluating six different cameras used to capture dataset images. Nevertheless, they did not consider the potential effects of lens attachments. Therefore, it is necessary to explore the role of physical capture configurations, such as focal length variation, in augmenting or improving fabric classification performance. While augmentation techniques simulate scale diversity, they cannot fully replicate the optical effects of genuine focal changes, such as depth of field, resolution shifts, and real-world detail gradients [
15,
16].
2. Methodology
2.1. Experimental Workflow
The research process flowchart presents the structured experimental workflow designed to investigate the influence of varying focal lengths on CNN-based fabric classification performance (
Figure 1). The process begins with image acquisition, in which fabric images are captured under four focal length configurations: far, mid, and near.
This is followed by dataset organization, where captured images are grouped according to their focal length category, and dataset preparation, which involves splitting the organized data into training and validation subset and testing subset. Two CNN architectures are separately trained on each focal length-based dataset. Afterward, the trained models are evaluated using validation and inference accuracy to determine their classification performance on each dataset variant. Finally, the results are compared to identify performance trends across focal lengths and between the two architectures, for a detailed analysis of how focal length variation affects CNN-based fabric classification systems.
2.2. Image Acquisition Device
An image acquisition prototype was designed and developed; identical to that described in [
17] as part of the author’s previous work to facilitate fabric image collection. The annotated design of the device is shown in
Figure 2a. It has a dimension specification of 17.78 × 17.78 × 33.02 cm (length × width × height).
The camera lens is positioned 8.94 cm above the sample platform, providing the optimal focal distance required for high-detail fabric image capture. Additionally, a 1.27 cm insertion gap is integrated at the base of the device to facilitate easy placement and alignment of fabric samples. Image acquisition is performed using a 4-megapixel high-definition USB camera with a native resolution of 2560 × 1440 to enable detailed capture of fabric textures. Illumination is provided by USB-powered LED strip lights, delivering stable 5 V power directly from the Raspberry Pi without an external supply. For user interaction, a 7-inch touchscreen LCD (800 × 480 resolution) is used, eliminating the need for peripheral input devices. The camera is equipped with a 5–50 mm varifocal lens, which supports adjustable focal lengths to enhance image sharpness and emphasize fine fabric details.
The experimental image acquisition setup materialized through 3D printing technology (
Figure 3a). The device accommodates a minimum fabric sample dimension of 5.08 × 5.08 cm, defined by the lens’ maximum zoom capability and field-of-view constraints. For improved handling and alignment, larger samples extending beyond the device width are recommended. Additionally, the prototype enclosure incorporates a front access opening, facilitating precise lens adjustments without requiring system disassembly. A custom-designed GUI for the image acquisition device is presented in
Figure 3b. The interface features a capture button alongside progress indicators that display focal length configuration batch progress and the total number of images captured per class during each data collection session. Each focal length configuration batch is set to capture exactly 100 images, with the system automatically halting further image storage once the target count of 300 images for that class is reached.
2.3. Data Gathering
An image dataset consisting of a total of 1350 fabric images was prepared by capturing random partial views of each fabric input sample.
Figure 4 shows the representative images from each class and focal length configurations.
The dataset is split into the training, validation, and testing subsets. The training and validation subset contains 900 images, with each fabric class having 300 images evenly split into far, mid, and near views (100 each) based on the applied focal length configuration. The dedicated testing subset has a split of 450 images, with each class having 150 images split into 50:50:50, with each focal length configuration used specifically to test the combined (training and validation) subset trained model.
2.4. Model Architectures
To comprehensively evaluate the impact of focal length variations on classification performance, this study employs two widely recognized CNN architectures: MobileNetV2 and ResNet50. These models were selected to present a balanced spectrum of architectural philosophies, lightweight, and residual networks for a diverse performance comparison under the same experimental conditions. MobileNetV2 is designed for computational efficiency and optimized for deployment on resource-constrained devices. It utilizes inverted residual blocks and depthwise separable convolutions, reducing the number of parameters while maintaining competitive accuracy, making it suitable for lightweight yet effective feature extraction [
18]. ResNet50 employs residual connections, a technique that allows the network to train effectively even at considerable depth by mitigating the vanishing gradient problem. Its 50-layer architecture is structured using bottleneck residual blocks, which improve computational efficiency while enabling the model to learn highly abstract and complex feature hierarchies. This capability is advantageous for texture recognition tasks, where variations in focal length subtly alter spatial detail and feature distribution. By maintaining representational strength across layers, ResNet50 remains robust to such variations, improving the model’s ability to generalize to images captured at different distances and zoom levels [
19].
2.5. Model Training
The process flow of the model training is shown in
Figure 5. The process begins with the image dataset as input, followed by preparation for augmentation and preprocessing. To enhance model generalization and mimic real-world variability, several augmentation techniques were applied, including random horizontal and vertical flips, minor rotations, shifts in width and height, brightness adjustments, and shear transformations, while excluding zoom variations to align with the study’s objectives. All augmentations employed nearest-neighbor filling to preserve edge integrity. Model reconstruction via transfer learning varied only in the preprocessing configurations specific to each architecture.
In MobileNetV2, the model was initialized with pretrained ImageNet weights, with its depthwise separable convolutional backbone initially frozen to retain robust low-level and mid-level feature extraction capabilities. Input images were standardized to 224 × 224 pixels and normalized within a [0, 1] range to align with MobileNetV2’s expected preprocessing. On the other hand, ResNet50 was initialized with pretrained ImageNet weights, with its deep residual convolutional layers initially frozen to preserve the pretrained hierarchical feature mappings. Input images were resized to 224 × 224 pixels and processed using the ResNet50-specific pre-process input function.
2.6. Cross-Dataset Evaluation
To assess the cross-generalization capability of CNN trained on fabric images captured at varying focal distances, four distinct datasets were utilized: Far, Mid, Near, and Combined, each containing three fabric classes. For each dataset, the network was trained exclusively on its corresponding training split and validated on its respective validation set to monitor convergence. Following training, the model underwent cross-domain evaluation, where it was tested on the remaining datasets without overlapping to simulate real-world shifts in image acquisition conditions. This ensured that a model trained on, for example, the Far dataset was never tested on the Far test split, but instead on the Mid, Near, and Combined sets.
3. Results and Discussion
MobileNetV2 and ResNet50 were trained and evaluated under varying focal length conditions in the collected image datasets. Each model was trained on one dataset and evaluated on the remaining datasets, enabling cross-domain generalization assessment. The classification performance, measured primarily through accuracy, is presented and compared to identify the effects of focal distance variation and architectural differences on model performance. The combined dataset pertains to the combined dataset of Far, Mid and Near while the combined testing is the dedicated testing dataset to represent the combined dataset of different focal length variations. The models were trained under uniform conditions to ensure fair and unbiased performance evaluation. Each model will not have target accuracy and a fine-tuning process providing a consistent performance benchmark across architectures. Training was limited to a maximum of 10 epochs, with an early stopping patience of 3 epochs to halt training if no improvement in validation accuracy was observed.
Table 1 presents the trained model validation accuracy based on the training and validation split dataset of (80/20). All models demonstrated strong performance in classifying fabric images captured under consistent focal length and zoom configurations. However, such conditions rarely hold in practical applications.
More realistic results are shown in
Table 2 when subject to cross-dataset evaluation. MobileNetV2, as a lightweight architecture, demonstrated an advantage when trained on the combined dataset, achieving near-perfect accuracy across all test sets: Far (98.67%), Mid (99.67%), Near (99.33%), and CombinedTesting (96.00%), with an overall average accuracy of 98.42%. In contrast, models trained on single focal length datasets exhibited reduced performance when tested on different focal lengths, indicating limited adaptability. ResNet50 showed improvements when trained on different focal lengths but generally performed good with noticeable lesser accuracy in the Mid-trained model, performing an average of 74.22%. These results suggest that incorporating diverse focal lengths in the training set enhances generalization capability and highlights the importance of focal length diversity in training data to enhance model accuracy in real-world fabric classification tasks, specifically when using MobileNetV2 and ResNet50 as the CNN base architecture. Equation (1) was used to determine the absolute accuracy gain of each model with the result summary shown in
Table 3.
The MobileNetV2-based trained models have shown an average absolute accuracy gain of 20.57, while the ResNet50-based model showed 9.78.
4. Conclusions and Recommendations
The incorporation of focal length variations in dataset acquisition setup significantly influences the overall performance of CNN-based classification. Using a varifocal lens image acquisition technique, datasets were generated at multiple focal distances and evaluated with two CNN architectures, MobileNetV2 and ResNet50. Cross-dataset evaluations revealed that the combined focal length dataset consistently delivered higher results, with MobileNetV2 achieving 98.42% testing accuracy and ResNet50 reaching 96.30%. Noticeable performance improvements were observed in MobileNetV2, where it recorded an absolute accuracy gain of 20.57, while ResNet50 achieved an average absolute accuracy gain of 9.78. Integrating focal length-based image diversity enhances model generalization capabilities. These results proves that the integration of focal length-based image diversity in datasets improves model generalization capabilities. Future research should extend this approach to other material recognition domains, investigate additional optical and hardware techniques to enrich dataset variability, and evaluate focal length-based datasets across a broader range of CNN architectures. Furthermore, validating the scalability of varifocal lens methods in diverse computer vision tasks and expanding fabric dataset diversity could further improve CNN classification performance.