1. Introduction
Amid growing global emphasis on energy conservation and emission reduction, the maritime industry faces mounting environmental challenges. As the primary regulatory body for international shipping, the International Maritime Organization (IMO) has introduced a series of regulations to promote sustainable transformation. In response, major shipping companies have increasingly adopted Low-Sulfur Fuel Oil (LSFO) [
1,
2].
Although low-sulfur fuel oil (LSFO) can significantly reduce sulfur oxide emissions, its complex composition, poor stability, and batch-to-batch variability also introduce considerable challenges [
3,
4]. These characteristics pose new threats to the operational stability and reliability of marine diesel engines, particularly affecting cylinder lubrication systems [
5,
6]. Under conventional fuels, engine manufacturers usually provide recommended lubrication feed rates based on fuel type and operating conditions, typically ranging from 0.8 to 1.4 g/(kWh) [
7]. However, with the adoption of LSFO, these reference values have become increasingly unsuitable [
8,
9,
10]. Especially under fluctuating sea conditions, adjusting the lubrication feed rate (FR) and base number (BN) has become substantially more complex [
11,
12,
13]. Islam and Martin [
14] performed a reliability assessment of marine main propulsion diesel engines, demonstrating that variations in operating conditions and maintenance strategies are critical determinants of engine longevity and operational stability.
Due to the lack of standardized guidelines and model-based support, current lubrication strategies onboard still largely depend on crew members’ subjective judgment and trial-and-error operations. Such experience-driven approaches suffer from poor repeatability, increased maintenance costs, and unstable lubrication performance. An appropriate feed rate is critical to the stable operation of two-stroke marine engines: adequate lubrication enables cooling, cleaning, sealing, and anti-corrosion functions while maintaining a stable oil film to reduce friction and wear. Conversely, over-lubrication may cause excessive deposits and waste, whereas under-lubrication can lead to scuffing, seizure, and even premature failures is illustrated in
Figure 1. Under LSFO conditions, variations in fuel composition and operational environments further amplify these uncertainties.
In summary, the application of LSFO has complicated lubrication feed rate strategies, making it necessary to draw insights from existing studies. However, most related research has focused on fuel properties and tribological mechanisms, with a noticeable gap in addressing intelligent lubrication decision-making. For instance, Cihan et al. [
15] combined test-bench experiments and numerical simulations to evaluate the combustion and emission characteristics of biodiesel blends, observing reduced CO and soot but increased NOx and fuel consumption; however, their work was limited to test-bench engines and specific biodiesel formulations, with limited applicability to low-speed two-stroke LSFO scenarios. Geng et al. [
16] investigated ternary diesel/ethanol/n-butanol blends in marine engines and introduced a data-driven diagnostic framework, revealing the high sensitivity of combustion and emissions to fuel ratio and injection parameters, but without direct linkage to cylinder surface conditions or lubrication settings. Meanwhile, the development of green shipping has accelerated the adoption of alternative fuels. Sagin et al. [
17] emphasized the impact of various alternative fuels on the emissions and performance of marine diesel engines, yet such fuel diversity further increases the uncertainty of lubrication parameter settings. On the other hand, physics-based modeling approaches have also been applied in engine condition monitoring. Fu et al. [
18] proposed a physics-driven framework for online condition monitoring of marine engine systems, aiming to support health management. However, such models often rely on simplified assumptions and lack cylinder-liner surface image evidence, thereby limiting their applicability to lubrication decision-making. In contrast, Jiao et al. [
19] developed a PR–CL lubrication model based on modified radial tension, quantifying its effects on oil film distribution and blow-by, yet relying on idealized boundary assumptions and limited validation data. Lyu et al. [
20,
21] analyzed tribofilm formation and lubrication failure mechanisms under temperature and thermo-mechanical effects, proposing a coupled thermal–frictional framework, but with limited connection to LSFO lubrication strategies. From an engineering perspective, Kamiński [
9] validated a feed-rate prediction and optimization method under real operational conditions, demonstrating practical applicability; however, his work did not account for deposit or scuffing risk prediction and lacked image-based validation under diverse sea states and complex disturbances, limiting its external generalizability. Overall, although valuable progress has been made in fuel, tribology, and engineering domains, studies focusing on image-based interpretable retrieval and feed-rate decision support remain scarce. Meanwhile, the cyclical nature of shipping operations and high crew turnover [
13] hinder knowledge retention, further exacerbating uncertainties in lubrication management.
Against this backdrop, developing a scientific, standardized, and intelligent lubrication decision system based on objective data has become an urgent requirement in ship operation and engine maintenance. Recent advances in deep learning and computer vision provide promising solutions. As a non-contact, low-cost, and efficient approach, computer vision enables high-resolution image acquisition and intelligent feature analysis for surface condition recognition [
5], and has been widely applied in industrial inspection and fault diagnosis. Foundational research has also laid methodological and engineering groundwork: Liu et al. [
22] systematically reviewed computer vision and deep learning workflows, providing reproducible practices from preprocessing to deployment for lightweight modeling and edge applications; Kim et al. [
23] developed a deep hybrid diagnostic model for ship engines, demonstrating the applicability of deep learning in maritime machinery health monitoring; and Yang et al. [
24] integrated visual and sensor data via hybrid deep learning, showing that multimodal fusion enhances detection accuracy and robustness, directly inspiring extensions to cylinder image retrieval frameworks.
In practice, traditional visual inspection of deposits through scavenging ports is subjective and inefficient. By leveraging computer vision, high-resolution liner surface images can be captured and processed with deep learning models to extract discriminative features and link them to historical operating conditions, enabling intelligent matching between images and lubrication feed rates. In particular, retrieval-based approaches allow the system to automatically identify similar cases from historical databases and recommend corresponding lubrication parameters, thereby improving the scientific rigor and stability of lubrication management while reducing reliance on manual expertise and supporting interpretable decision-making for intelligent ship maintenance.
Nevertheless, image-driven research in marine cylinder lubrication remains at an exploratory stage. Zhang Guichen et al. [
25] applied Hu invariant moments to extract features from cylinder images, performing normalization by rotation, scaling, and translation to analyze lubrication states. However, this method is highly sensitive to illumination variations and local blurring, often causing feature distortion and reduced retrieval accuracy. In real maritime environments, cylinder images are subject to diverse interferences such as angle deviations, illumination changes, and contaminant effects, leading to high complexity and variability. These challenges expose the limitations of shallow, hand-crafted feature methods [
26], highlighting the urgent need for lightweight deep learning models with enhanced robustness and deployability to support intelligent lubrication management in practice.
To address these limitations, this study focuses on typical marine engine cylinders and constructs a high-quality RGB image dataset encompassing diverse wear patterns and lubrication states under various operating conditions. Although modules such as EfficientNet, Vision Transformers (ViT), RFB, and CBAM have been extensively studied in general vision tasks, their direct application to marine cylinder lubrication image retrieval remains unexplored. The complex maritime environment, with illumination disturbances, oil contamination, and constrained onboard computational resources, introduces unique challenges that have not yet been sufficiently resolved. To address this gap, this study proposes a novel lightweight deep learning architecture, integrating CNNs and ViT to jointly capture local texture details and global structural information. By improving and tailoring these architectures for the marine lubrication scenario, the proposed model aims to maintain retrieval accuracy while reducing computational complexity and enhancing deployment efficiency. This research not only contributes to the intelligent operation and maintenance of marine diesel engine lubrication systems but also provides technological support for achieving emission reduction goals in the maritime industry, highlighting its engineering significance and research value.
The main contributions of this study are as follows:
- (1)
A high-quality and representative image dataset of marine engine cylinder lubrication states was constructed using operational data from WINNING’s large bulk carriers under various sea and engine conditions.
- (2)
A lightweight image retrieval model for marine engine cylinders was proposed, with the following innovations:
- (A)
A novel feature extraction network was designed, integrating EfficientNetB0 as the backbone with MobileViT modules. This hybrid structure enhances the model’s sensitivity to complex backgrounds and subtle wear patterns, improving retrieval robustness and accuracy across variable conditions.
- (B)
To further optimize performance, the model incorporates a Receptive Field Block (RFB) to expand the receptive field and improve multi-scale feature modeling, along with the CBAM to strengthen attention to key regions.
- (3)
The developed computer vision-based lubrication system was applied on the WINNING UNIVERSE, effectively addressing issues such as insufficient engineering expertise and uncertainty in cylinder oil adjustment.
2. Materials and Methods
2.1. Study Subject
This study focuses on the cylinder systems of marine main engines. The data were collected from two large bulk carriers operated by WINNING (Singapore)—WINNING UNIVERSE and SUNNY KANKAN. Each vessel has a deadweight of approximately 200,000 tons and operates on a fixed route between Yantai Port, China, and Boké Port, Guinea, characterized by complex and variable sea conditions, making them highly representative. Both vessels are equipped with MAN B&W 6S60MC low-speed, two-stroke diesel engines featuring six-cylinder configurations, which are widely used in large bulk carriers and are thus considered representative models for this study.
2.2. Data Acquisition and Preprocessing
2.2.1. Cylinder Image Acquisition from Marine Main Engines
This study aims to develop an image retrieval model for marine engine cylinders, utilizing an image-based matching mechanism to optimize current lubrication strategies based on historical cylinder conditions and corresponding oil feed rate data. The goal is to reduce the risk of failures such as carbon buildup and scuffing, thereby ensuring the safe and stable operation of main engines.
For image acquisition, a digital camera was employed to capture fine details of wear and carbon deposits on the inner cylinder wall. The process involved manual operation during engine shutdown: crew members entered the engine room and photographed the inner walls of each cylinder through the scavenging ports from multiple angles. This ensured comprehensive visual coverage of critical features such as deposit formation and surface wear, providing high-quality data for model training and lubrication optimization. The data collection workflow is illustrated in
Figure 2.
To enhance representativeness and applicability, image acquisition covered typical operating scenarios such as engine shutdowns and port stays, under various sea conditions. Since 13 February 2020, a total of 2838 cylinder images have been collected, forming a dataset with broad temporal and operational diversity.
2.2.2. Preprocessing of Cylinder Images from Marine Main Engines
Given the absence of publicly available datasets specifically for marine engine cylinder image retrieval, and the impracticality of capturing scavenging port images during normal ship operation, the image samples in this study are primarily sourced from the bulk carriers WINNING UNIVERSE and SUNNY KANKAN. Due to the limited number of raw images, data augmentation was applied to enhance structural diversity, improve model generalization, and mitigate overfitting. Augmentation techniques included image rotation, random cropping, brightness adjustment, sharpness variation, and random occlusion—simulating variations caused by lighting, equipment, and operator differences. Examples of augmented images are shown in
Figure 3.
Following augmentation, the dataset was expanded to 31,218 images. To meet memory constraints during deep learning training, all images were resized to 224 × 224 pixels prior to input. This preprocessing preserved essential cylinder features while reducing computational complexity, enabling efficient model training under limited hardware conditions.
2.3. Overall Technical Workflow
This study focuses on optimizing lightweight CNNs by integrating multiple performance-enhancing modules. A high-accuracy, lightweight image retrieval model for marine engine cylinders is proposed, capable of returning similar historical images along with corresponding oil feed rate data. This output provides practical guidance for engine crew in assessing current cylinder conditions, optimizing lubrication strategies, and planning maintenance. The overall workflow is shown in
Figure 4.
In the model construction phase, the design prioritizes both accuracy and computational efficiency. A lightweight CNN serves as the backbone, enhanced with ViT and attention mechanisms to improve retrieval precision while maintaining a compact architecture. During the training phase, the augmented dataset is split into training and validation sets in an 8:2 ratio, containing 24,974 and 6244 images, respectively. Transfer learning is employed to facilitate efficient model training. In the validation phase, the performance of the optimized lightweight model is evaluated using a test set, and its effectiveness in real-world image retrieval tasks is demonstrated.
2.4. Design of a Lightweight Image Retrieval Network
This study proposes a high-accuracy, lightweight image retrieval model tailored for marine engine cylinders. The overall network architecture is illustrated in
Figure 5. The model comprises three key components: a backbone network, a feature enhancement module, and a similar image retrieval module. Considering deployment in resource-constrained environments, EfficientNetB0 is selected as the backbone to extract foundational semantic features [
27].
To enhance the model’s multi-scale perception and improve local structure representation, a Receptive Field Block (RFB) is introduced following the backbone. RFB employs parallel convolutional paths with varying receptive field sizes to effectively capture features at different spatial scales, improving the network’s ability to represent fine structural details such as scratches and carbon deposit edges [
28].
To address the limitations of convolutional networks in modeling global dependencies, a MobileViT module is incorporated to perform global feature modeling. By leveraging the capabilities of Vision Transformers, this module captures long-range dependencies and complements the CNN’s local focus, enhancing the completeness and expressiveness of feature representations [
29].
To further improve the model’s discriminative capacity for key regions, a CBAM is integrated [
30]. CBAM combines channel and spatial attention mechanisms to adaptively highlight critical regions and suppress background noise and redundant information, enhancing sensitivity to wear and deposit patterns on cylinder surfaces.
In the output stage, Global Average Pooling is applied to reduce feature dimensionality and improve generalization, while an L2 Normalization layer standardizes embedding vectors as unit-length outputs. This ensures consistent and stable similarity computation during retrieval. The coordinated design of these modules significantly enhances the model’s performance in marine cylinder image retrieval tasks.
2.4.1. Backbone Construction Based on EfficientNetB0
The backbone network in this study integrates EfficientNetB0 with RFB and MobileViT modules, achieving a balance between computational efficiency and feature representation. By combining receptive field expansion and global modeling, the network strengthens its capability to capture image semantics. EfficientNetB0, based on Mobile Inverted Bottleneck Convolution (MBConv) and a compound scaling strategy, adjusts depth, width, and resolution jointly to optimize the trade-off between compactness and performance. Its overall architecture is shown in
Figure 6. The MBConv module expands channels through a 1 × 1 pointwise convolution, followed by batch normalization (BN) and the Swish activation function, as defined in Equation (1):
In this activation, x denotes the input to the nonlinear unit (the output of the preceding layer). The function is the logistic sigmoid, , which maps each input to (0, 1). The Swish function is then defined as , applied element-wise for vectors or tensors.
This is followed by depthwise separable convolution for spatial feature extraction and another 1 × 1 convolution for channel compression. Finally, a Squeeze-and-Excitation (SE) attention mechanism is integrated to model and reweight inter-channel importance, thereby enhancing the model’s responsiveness to critical features. The corresponding operations are shown in Equations (2)–(5):
In these equations, represents the transformed feature representation, which is obtained by weighting the input features X with the weight matrix . is the compressed result obtained by applying the Squeeze operation to the input feature , representing the importance of the channel features. Next, s is the scalar value calculated using the activation function , which adjusts the weight of the features. Finally, the weighted feature is generated by scaling the input feature with the channel weight .
MBConv applies depthwise convolution with stride 1 and residual connections when maintaining resolution, and uses stride 2 with increased channels for downsampling to enhance high-level feature representation.
To enlarge the receptive field and improve anomaly detection, an RFB module is integrated into the backbone. Using a multi-branch design with 1 × 1, 3 × 3, and 5 × 5 kernels, each followed by dilated convolutions with varying rates, RFB simulates multi-scale receptive fields while maintaining low complexity. The concatenated outputs form a rich representation that complements EfficientNetB0 in capturing complex wear patterns and detecting subtle anomalies in cylinder lubrication images, as illustrated in
Figure 7.
To overcome the limitations of convolutional structures in modeling long-range dependencies, this study integrates the MobileViT module into the backbone network. By embedding self-attention mechanisms within local windows, the module enables global context modeling while preserving the local receptive capabilities of convolution, thereby enhancing structural representation completeness and robustness under complex operating conditions.
The MobileViT module consists of three components: a local feature encoding module, a global feature encoding module, and a feature fusion module. Its overall architecture is shown in
Figure 8. First, the local encoding module receives an input tensor
and applies a standard
convolution to extract local spatial features. This is followed by a 1 × 1 pointwise convolution to project the features into a higher-dimensional space, yielding an intermediate representation of size
.
Next, the tensor is partitioned into N non-overlapping flattened patches and reshaped into
, where
represents the number of pixels per patch. These patches are then passed through a Transformer encoder in the global encoding module to model long-range dependencies between them, producing a global feature tensor of the same dimensionality
. This process effectively maps local features to global semantics. The encoding procedure is shown in Equation (6):
Here, is the locally encoded feature of the ppp-th patch, Transformer(·) denotes global encoding, and is the globally encoded output, with p ∈ [1, P] and P = h × w.
The feature fusion module then reshapes the output of the global encoding module back to the original spatial dimensions, yielding a tensor of size . A 1 × 1 convolution is applied to project this tensor into a lower-dimensional space, which is then concatenated with the original input feature X to form a new tensor . Finally, an convolution is used to fuse the concatenated features, producing the final output tensor .
Through this three-stage structure, the MobileViT module effectively integrates local and global information, providing richer feature representations for cylinder lubrication image retrieval. The improved backbone network, rebuilt upon the EfficientNetB0 structure, is shown in
Figure 9.
The structural parameters are shown in
Table 1.
2.4.2. CBAM Attention Enhancement Mechanism
Images from scavenging ports often contain interferences such as oil stains that visually resemble target regions, increasing retrieval difficulty. To address this, the CBAM module is integrated into the backbone, where channel and spatial attention guide the network to focus on key regions (e.g., carbon deposits, piston rings, pistons) while suppressing irrelevant noise. This enables adaptive feature refinement with minimal additional complexity.
As shown in
Figure 10, CBAM sequentially applies channel and spatial attention to refine input features. Channel attention generates descriptors via global average and max pooling, which are passed through a shared multilayer perceptron (MLP) to compute channel weights. The reweighted features are then processed by spatial attention, where average and max pooling along the channel dimension produce spatial maps. After concatenation, a 7 × 7 convolution and Sigmoid activation yield the spatial attention map, which is multiplied with the input features to produce the final output. This mechanism adds negligible computational cost while significantly enhancing the model’s focus on informative regions and its feature representation capacity. The attention computation is shown as follows:
Here, denotes the output of the channel attention module, and denotes the output of the spatial attention module. F is the input feature map. AvgPool and MaxPool represent global average pooling and global max pooling operations, respectively; MLP refers to a multilayer perceptron; denotes the Sigmoid activation function; indicates a 7 × 7 convolution operation; and denotes the concatenation of the pooled feature maps along the channel axis.
2.4.3. Design of the Lightweight Retrieval Module
After feature enhancement by the CBAM module, the lightweight image retrieval module receives a high-dimensional feature map of size 7 × 7 × 1280. First, global average pooling (GAP) is applied to compress the feature map into a 1 × 1 × 1280 vector, effectively aggregating semantic information across the entire image. This operation reduces feature dimensionality while preserving global responses across channels, facilitating the construction of a unified and compact image representation for subsequent retrieval tasks.
Next, the resulting feature vector is passed through an L2 normalization layer, projecting it onto a unit hypersphere by normalizing its norm to 1. This step eliminates magnitude disparities among features and ensures that similarity comparisons focus on directional differences rather than absolute values, thereby improving the stability and comparability of distance calculations. The normalized features exhibit stronger discriminative capability, contributing to improved retrieval accuracy.
In practical applications, the same feature extraction and normalization processes are applied to both the query image and all images in the database. Image similarity is then measured using cosine similarity, and the top-ranked matches with the highest similarity scores are selected as retrieval outputs. The cosine similarity is shown in Equation (9):
In this equation, and represent the feature vectors of the query image and a database image, respectively.
This retrieval mechanism provides marine engineering personnel with an efficient and intuitive basis for image matching, facilitating rapid assessment of cylinder lubrication conditions and identification of carbon deposits or abnormal wear. It offers reliable support for oil feed rate adjustment, maintenance decision-making, and fault prediction.
2.5. Model Training
During the experimental phase, a rigorously controlled environment was employed to ensure the reproducibility and reliability of results. Python 3.12 was used as the primary programming language in conjunction with the Windows 11 operating system, providing a stable foundation for experimentation. The deep learning framework PyTorch 2.7 served as the core platform for model training and evaluation. Additionally, computational efficiency was enhanced through the use of an NVIDIA GeForce RTX 4060 GPU and a 13th-generation Intel Core i9-13980HX CPU. The key hyperparameter settings used during model training are summarized in
Table 2.
2.6. Model Evaluation
To comprehensively evaluate the model’s accuracy in image retrieval tasks, this study adopts three commonly used performance metrics: True Positive (TP), False Positive (FP), and False Negative (FN). TP refers to the number of correct target samples successfully retrieved by the model that match the query image. FP denotes the number of non-target samples incorrectly identified as matches, while FN represents the number of actual target samples the model fails to retrieve. These metrics collectively reflect the model’s correctness and error rate from different perspectives. The corresponding calculation formulas are as follows:
To more comprehensively assess the practical deployment potential of the image retrieval model, this study introduces three lightweight evaluation metrics—model size, GFLOPs, and FPS—as effective complements to accuracy-based indicators. From a computational resource perspective, model size reflects the parameter space occupied by the model, indicating its demand for memory and storage. A smaller model size implies lower deployment requirements, particularly suitable for resource-constrained environments such as edge devices and embedded systems.
GFLOPs (Giga Floating Point Operations) measures the total number of floating-point operations required for a single forward pass, serving as an indicator of computational complexity. Higher GFLOPs values suggest greater computational load and potentially longer inference time. FPS (Frames Per Second) quantifies the number of image frames the model can process per second, providing a key measure of real-time performance. A higher FPS indicates faster response and better suitability for real-time image retrieval tasks.
Together, these three metrics constitute essential dimensions for evaluating the model’s lightweight characteristics, facilitating deployment optimization and runtime efficiency without compromising retrieval accuracy.
3. Results
3.1. Comparison of Backbone Network Architectures
This study aims to develop a lightweight yet high-performance image retrieval model for cylinder lubrication, targeting practical deployment in resource-constrained environments. To this end, a systematic comparison was conducted across several mainstream convolutional neural network architectures, including AlexNet [
31], VGG16 [
32], GoogLeNet (Inception v1) [
33], ResNet50 [
34], and EfficientNetB0.
The evaluation was performed using four metrics: Top-1 Accuracy, GFLOPs, FPS, and Model Size. Among these, model size was considered the primary criterion, reflecting deployment feasibility under storage constraints. The other three metrics served as complementary indicators to assess the trade-off between accuracy and computational efficiency. The detailed results are presented in
Table 3.
The results indicate that among the five evaluated CNN architectures, EfficientNet-B0 demonstrates a significant advantage in model compactness, with a parameter size of only 5.3 MB—substantially smaller than the other models. It also outperforms the alternatives in both Top-1 classification accuracy (77.1%) and GFLOPs (0.39), exhibiting a well-balanced trade-off between accuracy and computational efficiency. Although its inference speed is slightly lower than that of AlexNet, EfficientNet-B0 achieves a considerably smaller model size while maintaining high accuracy and low computational cost, making it more suitable for real-world deployment scenarios. Accordingly, this study adopts EfficientNet-B0 as the baseline model for subsequent exploration of lightweight structural design and further optimization.
3.2. Comparison of Lightweight Model Improvements
To further enhance the performance of the image retrieval model while meeting lightweight deployment requirements, this study introduces architectural improvements based on EfficientNet-B0. By integrating the RFB, MobileViT Block, and CBAM attention mechanism into the EfficientNet-B0 framework, the model achieves significant gains in feature extraction and representation capability without substantially increasing its size.
The objective of these enhancements is to improve retrieval accuracy for cylinder lubrication images while preserving a lightweight structure suitable for real-world deployment. To comprehensively evaluate the effectiveness of the proposed modifications, this study compares the original and improved models using the pre-divided dataset across six dimensions: Recall, Precision, F1-score, Model Size, GFLOPs, and FPS. The detailed performance differences and improvements are presented in
Table 4.
As shown in
Table 4, all improved models exhibit significantly better retrieval performance than the original EfficientNet-B0, while maintaining low computational overhead. Notably, the incorporation of the MobileViT Block increases the recall to 93.04% and the F1-score to 93.27%, representing a substantial improvement over the baseline. Meanwhile, the model size increases by only approximately 6 MB, and GFLOPs rise from 0.39 to 0.60—still within the lightweight range.
In the comparative experiments involving attention mechanisms, both ECA and CBAM modules contributed to improvements in retrieval performance. Overall, the performance advantage of CBAM over ECA was relatively limited. Both achieved identical precision scores (93.56%), with a slight increase in recall from 93.15% to 93.18% and a marginal rise in F1-score from 93.35% to 93.37%. Given the minimal differences in quantitative metrics, this study further employed heatmap visualizations to intuitively compare the two mechanisms in terms of feature extraction and spatial attention. The visualization results are shown in
Figure 11.
As shown in the heatmap comparison, CBAM (left) yields more comprehensive coverage of the cylinder region, effectively capturing the entire area of interest. In contrast, the ECA heatmap (right) exhibits narrower focus, with a risk of missing critical structures, indicating weaker spatial awareness. The color scale reflects the relative attention intensity, where warmer colors (e.g., red and yellow) indicate regions with higher attention weights, while cooler colors (e.g., blue) represent lower responses. Based on this observation, CBAM was ultimately selected as the preferred attention mechanism for this study.
To further enhance retrieval accuracy, the RFB was introduced alongside the attention mechanism to expand the effective receptive field and strengthen the representation of local features. The integration of RFB significantly improved the model’s ability to capture multi-scale semantic information, thereby enhancing feature matching robustness. Experimental results confirm that the model achieved optimal performance with the addition of RFB: Recall reached 99.69%, Precision 99.71%, and F1-score 99.70%. Meanwhile, the model maintained a compact size of 29.3 MB, a computational complexity of 0.69 GFLOPs, and an inference speed of 98 FPS, striking a favorable balance between accuracy and efficiency. These results underscore its strong engineering applicability and deployment potential.
Taking into account retrieval performance, model size, and inference speed, this study identifies the EfficientNetB0 + RFB + MobileViT Block + CBAM configuration as the optimal lightweight solution for cylinder lubrication image retrieval tasks.
3.3. Modeling Results of the Optimal Lightweight Strategy
Model convergence is also a critical indicator for evaluating reliability. Therefore, this study visualizes the training dynamics of the optimal lightweight cylinder lubrication image retrieval model—EfficientNetB0 + RFB + MobileViT Block + CBAM. The plotted curves include training loss (Train Loss), validation loss (Val Loss), Precision, Recall, and F1-score over the course of training. The corresponding results are shown in
Figure 12,
Figure 13,
Figure 14 and
Figure 15.
The results demonstrate that the loss curves of the EfficientNetB0 + RFB + MobileViT Block + CBAM model exhibit effective convergence during both training and validation, indicating the model’s reliability. Moreover, the training curves of Precision, Recall, and F1-score also show clear convergence and optimization. These findings confirm the robustness and reliability of the proposed lightweight image retrieval model for cylinder lubrication.
3.4. Retrieval Results of Cylinder Lubrication Images
This study aims to develop an image retrieval model for analyzing the lubrication state of marine engine cylinders. To ensure practical applicability and generalization, the collected dataset encompasses typical variations such as carbon buildup and wear under diverse sea conditions. To evaluate the model’s retrieval performance across different operating scenarios, a subset of image samples was randomly selected during the testing phase to conduct real-world retrieval tasks. The objective was to assess matching accuracy and robustness under multi-condition backgrounds. The detailed results are presented in
Figure 16,
Figure 17 and
Table 5.
Figure 18.
Image Retrieval Results for Lubrication Rate Adjustment: Comparison of Query and Similar Cases.
Figure 18.
Image Retrieval Results for Lubrication Rate Adjustment: Comparison of Query and Similar Cases.
Application and validation experiments were conducted using the proposed lightweight cylinder lubrication image retrieval model on both full-cylinder images and localized carbon deposit regions. The results demonstrate that the model achieves high retrieval accuracy and exhibits strong stability and generalization across diverse input conditions. Specifically, it is capable of efficiently and accurately identifying the most similar historical image samples from the augmented dataset, thereby providing reliable decision support and a model foundation for optimizing cylinder oil feed rates.
3.5. Application Workflow of the Cylinder Lubrication Image Retrieval Model
During port stays, engine crew typically conduct routine inspections of the main engine cylinders via scavenging ports. To enable intelligent evaluation of lubrication conditions and optimization of oil feed rates, this study integrates image acquisition into the inspection process and inputs the captured scavenging port images into the proposed lightweight image retrieval system.
During retrieval, the system compares the input image with samples in the historical image database and returns the most visually similar image, along with its associated operational data—such as oil feed rate, operating time, and wear condition—as reference for current lubrication decisions. It is important to note that, to ensure the effectiveness of strategy adjustment, the system must prioritize similar samples that demonstrated improved operational outcomes during subsequent inspection cycles. This avoids incorrect recommendations based on degraded historical samples and enhances the reliability and practical applicability of the model in real-world engine environments.
3.6. Real-Ship Application
On 13 April 2025, a shutdown inspection was conducted on the main engine at Boké Port, Guinea, during which cylinder liner images were captured through the scavenge port. The images revealed significant carbon deposits on the piston crown, with the recorded lubrication rate for that voyage segment being 1.25 g/(kW·h) (see
Figure 18, top left). In the context of operational constraints and limited maintenance window, determining the appropriate adjustment of the lubrication rate based on historical data became a key decision-making factor for optimizing cylinder lubrication performance and preventing further damage.
To support this decision, the captured image was processed using the lightweight cylinder lubrication image retrieval model developed in this study. The system’s database stores metadata for each historical image, including the date, location, operational context, and lubrication rate, along with follow-up inspection images and their respective metadata. The system also retains samples with adverse outcomes (e.g., cases where carbon buildup increased or scuffing occurred after adjustments) to provide both positive and negative reference cases for decision-making.
The retrieval module returned the top-10 most similar cases to the query image, and for each case, an “adjustment before—adjustment after” comparison was made. Due to space limitations, only four representative cases with high similarity scores (96.9%, 96.26%, 95.79%, and 95.22%, see
Figure 18, right side and bottom) are presented. Among these cases, only Similar Image 1 demonstrated a clear reduction in carbon deposits following lubrication rate adjustment.
Specifically, Similar Image 1 was captured on December 6, 2023, at Boké Port, Guinea, and the follow-up inspection took place on February 6, 2024, in Singapore. In this case, the lubrication rate was adjusted from 1.25 g/(kW·h) to 1.15 g/(kW·h), resulting in a noticeable reduction in carbon buildup on the piston crown without any increase in scuffing (see
Figure 18, top right, “Similar Image 1” before-and-after comparison). This adjustment outcome closely aligned with the current scenario, making it a valuable reference for the present lubrication rate adjustment.
In contrast, the remaining three similar images (Similar Images 2, 3, and 4) showed lubrication rate changes, but their follow-up images did not indicate a significant reduction in carbon deposits. In some instances, carbon buildup either remained unchanged or worsened. Therefore, these images were not considered valid references for this adjustment, as they did not exhibit similar improvements.
Based on the positive results from Similar Image 1, a conservative lubrication rate reduction strategy was adopted for the current voyage, adjusting the rate from the previous range of 1.23–1.25 g/(kW·h) to 1.15 g/(kW·h). This decision was made after a thorough analysis of historical cases, taking into account factors such as fuel quality, load conditions, and other operational parameters, selecting a strategy that would both reduce carbon buildup and avoid risks of inadequate lubrication.
After the adjustment, the vessel completed its return voyage and docked at Fangchenggang Port, Guangxi, on 5 June 2025, where a follow-up inspection was conducted on the same day. New images were captured through the scavenge port, and after automatic evaluation by the retrieval model and manual verification, the results showed a significant reduction in carbon deposits compared to the baseline images from April, without any increase in scuffing (see
Figure 18, bottom left, “After Lubrication Rate Adjustment”). This outcome validates the effectiveness of the image retrieval-assisted lubrication rate adjustment strategy.
4. Discussion
This study aims to develop an efficient and lightweight image retrieval model for cylinder lubrication in marine engines, supporting intelligent oil feed rate optimization. In the context of frequent crew rotation and limited experience transfer, the proposed system leverages visual recognition to automatically match historical lubrication strategies, thereby mitigating fluctuations in lubrication management caused by human factors and demonstrating strong practical value.
In this research, we first reviewed the relevant literature to verify the feasibility and effectiveness of image retrieval-based lubrication adjustment. However, challenges such as lighting variation, shadow interference, and structural occlusion in scavenging port image acquisition significantly increased the complexity of image processing and model design. While higher algorithmic complexity may improve retrieval accuracy, it often leads to increased model size and slower inference—factors incompatible with real-world deployment on resource-constrained marine platforms.
To address these constraints, this study proposes a lightweight deep learning model that integrates CNNs with ViT modules for efficient image retrieval. EfficientNetB0 was selected as the backbone network after a comparative analysis of mainstream lightweight retrieval architectures. With a model size of only 21 MB and a Top-1 accuracy of 77.1%, EfficientNetB0 strikes a strong balance between accuracy and efficiency.
To enhance global structural modeling capabilities, the MobileViT module was integrated into the EfficientNetB0 backbone. While CNNs excel at local feature extraction, they are limited in capturing global contextual relationships. The addition of MobileViT improved Precision without significantly increasing model size (an approximate 6 MB increase).
To strengthen the connection between CNN and Transformer components, a RFB was introduced between EfficientNetB0 and MobileViT. The RFB, composed of multi-branch dilated convolutions, simulates features across varying receptive fields, enhancing the model’s ability to capture multi-scale structures and contextual dependencies—thus providing richer semantic input to the Transformer layers.
Additionally, two lightweight attention mechanisms, ECA and CBAM, were evaluated to further improve focus on salient image regions. CBAM was ultimately selected for its superior performance in retrieval accuracy, achieved without significantly increasing model parameters.
In summary, this study presents a lightweight feature extraction model comprising EfficientNetB0, RFB, MobileViT Block, and CBAM. The final model achieves an effective balance between accuracy, size (29.3 MB), and inference efficiency (GFLOPs within acceptable bounds), meeting the requirements for embedded deployment. The method has also undergone preliminary validation in real-ship operating environments, demonstrating its practical applicability.
5. Conclusions
This study proposes a lightweight image retrieval model for cylinder lubrication management by introducing targeted structural modifications to the EfficientNetB0 backbone. The designed network achieves a favorable balance between recognition accuracy and deployment efficiency, exhibiting clear advantages over baseline architectures. Quantitative evaluation on real-ship operational datasets demonstrates excellent retrieval performance, with a Precision of 99.71%, Recall of 99.69%, and F1-score of 99.70%. The model size is constrained to 29.3 MB with only a 0.3 GFLOPs increase in computational complexity, making it well suited for deployment in resource-constrained shipboard environments while satisfying the requirements of real-time inference. Compared with the lightweight backbone model EfficientNetB0, the proposed method achieves a more advantageous trade-off between accuracy and complexity, delivering superior retrieval performance with comparable or lower computational overhead, thereby underscoring its potential for edge deployment.
The practical value of the model is reflected in its “retrieval-driven decision support” mechanism. Under the widespread application of low-sulfur fuels, the system can retrieve case samples and corresponding empirical adjustment strategies from historical databases according to the currently acquired images. In doing so, it provides traceable and interpretable references for engineers in oil-feed regulation. By leveraging similarity metrics and Top-K nearest neighbor retrieval, the model reduces reliance on subjective experience, enhances transparency and explainability in decision-making, and contributes to improving the operational quality and cost-effectiveness of lubrication management. This, in turn, helps mitigate the risks of accelerated liner wear and increased lubricant consumption caused by excessive or insufficient lubrication.
Nevertheless, several aspects of this study remain to be improved. First, the training data are mainly derived from specific vessel types, fixed engine series, and limited operating conditions, which constrains the generalization capability and applicability of the model. Second, the shipboard validation was conducted on a single bulk carrier, serving primarily as a preliminary application trial. While these results provide an important foundation for subsequent research, they do not yet cover systematic validation across multiple vessel types and operating scenarios. In addition, significant differences in computational resources, interface protocols, and control logic across embedded platforms pose practical challenges for large-scale integration and cross-platform deployment.
Future work will focus on three directions: (1) expanding data collection and validation to include a wider variety of vessel types and operating conditions, thereby enhancing model generalization and robustness; (2) further optimizing the network architecture to reduce inference latency and energy consumption, while systematically benchmarking alternative lightweight designs across multiple performance dimensions; and (3) exploring multi-source data fusion by integrating sensor readings, engine operating parameters, and maintenance records into a unified decision framework, complemented with uncertainty estimation and anomaly detection mechanisms to improve diagnostic reliability and adaptability under complex conditions. These improvements are expected to substantially enhance diagnostic accuracy and operational robustness, extend the applicability of the model across diverse vessel types and scenarios, and provide a more reliable technical foundation for the development of intelligent ship maintenance and decision support systems.
In summary, this study validates the feasibility and application value of a lightweight image retrieval model in cylinder lubrication management. However, to achieve widespread applicability across platforms and operating conditions, continuous efforts are still required in generalization validation, resource adaptation, and multi-source data fusion. Only through such advancements can the model be better aligned with the practical demands of intelligent ship maintenance and provide more stable technical support for cylinder lubrication management in the era of low-sulfur fuels.