# Metaheuristics Optimization with Deep Learning Enabled Automated Image Captioning System

## Abstract

**:**

## 1. Introduction

## 2. Prior Image Captioning Techniques

## 3. The Proposed Model

#### 3.1. Data Pre-Processing

- Lower case conversion;
- Removal of punctuation marks to decrease complexity;
- Removal of numeric values;
- Tokenization;
- Vectorization (to turn the original strings into integer sequences where each integer represents the index of a word in a vocabulary).

#### 3.2. Feature Extraction: HybridNet Model

#### 3.3. Hyperparameter Optimization

Algorithm 1 Pseudocode of SSA |

1: Input: maximum iterations $L$, population size $m,ub,lb,l=I$ 2: Initialization of salp position $\left\{{u}_{1},{u}_{2},{u}_{3},\dots \dots \dots ,{u}_{m}\right\}$ 3: While (stopping criteria is not fulfilled) 4: Determine fitness of all salps 5: Arrange salp position based on fitness value 6: Define $F$ as optimal place for present population 7: Upgrade C _{l}8: For every salp position (${u}_{i}$) 9: If $\left(i\le m/2\right)$ upgrades the position of leading salps 10: Else upgrade the position of follower salp 11: end 12: end 13: Change the salp which crosses higher and lower limits 14: end 15: Display optimum output |

#### 3.4. Image Captioning

## 4. Performance Validation

#### 4.1. Performance Measures

#### 4.2. Result Analysis

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

**Figure 3.**Result analysis of MODLE-AICT algorithm on Flickr8K dataset (

**a**) BLEU-1, (

**b**) BLEU-2, (

**c**) BLEU-3, and (

**d**) BLEU-4.

**Figure 4.**Result analysis of MODLE-AICT algorithm on Flickr8K dataset (

**a**) METEOR, (

**b**) CIDEr, and (

**c**) Rouge-L.

**Figure 8.**Result analysis of the MODLE-AICT algorithm on the MS-COCO 2014 dataset: (

**a**) BLEU-1, (

**b**) BLEU-2, (

**c**) BLEU-3, and (

**d**) BLEU-4.

**Figure 9.**Result analysis of MODLE-AICT algorithm on MS-COCO 2014 dataset (

**a**) METEOR, (

**b**) CIDEr, and (

**c**) Rouge-L.

Sample Image | Different Captions |
---|---|

A crowd watching air balloons at night | |

A group of hot air balloons lit up at night | |

People are watching hot air balloons in the park | |

People watching hot air balloons | |

Seven large balloons are lined up at night-time near a crowd | |

A man climbs a rocky wall | |

A climber wearing a blue helmet and headlamp is attached to a rope on the rock face | |

A rock climber climbs a large rock | |

A woman in purple snakeskin pants climbs a rock | |

Person with blue helmet and purple pants is rock climbing | |

People on ATVs and dirt bikes are traveling along a worn path in a field surrounded by trees | |

Three people are riding around on ATVs and motorcycles | |

Three people on motorbikes follow a trail through dry grass | |

Three people on two dirt bikes and one four-wheeler are riding through brown grass | |

Three people ride off-road bikes through a field surrounded by trees |

Methods | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
---|---|---|---|---|

M-RNN Model [24] | 59.18 | 29.09 | 24.17 | 14.19 |

G-NICG Model [26] | 64.13 | 42.64 | 27.11 | 16.12 |

L-Bilinear Model [27] | 65.96 | 43.29 | 28.63 | 18.49 |

DVS Model [28] | 58.35 | 37.98 | 25.54 | 17.08 |

ResNet50 Model [23] | 62.65 | 46.28 | 37.26 | 26.16 |

VGA-16 Model [23] | 67.69 | 44.34 | 33.99 | 23.25 |

HPTDL Model [25] | 68.26 | 46.16 | 37.81 | 26.71 |

MODLE-AICT | 69.06 | 47.26 | 38.78 | 27.80 |

Methods | METEOR | CIDEr | Rouge-L |
---|---|---|---|

SCST-IN Model [29] | 20.00 | 161.00 | 49.00 |

SCST-ALL Model [29] | 23.00 | 154.00 | 42.00 |

G-NIC Model [26] | 19.00 | 153.00 | 43.00 |

A-NIC Model [26] | 21.00 | 160.00 | 48.00 |

DenseNet Model [24] | 25.00 | 173.00 | 43.00 |

HPTDL Model [25] | 28.00 | 175.00 | 46.00 |

MODLE-AICT | 30.00 | 179.00 | 53.00 |

Methods | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
---|---|---|---|---|

KNN Model [25] | 49.60 | 28.60 | 17.00 | 10.95 |

G-NICG Model [26] | 67.92 | 46.23 | 34.03 | 24.94 |

L-Bilinear Model [27] | 71.75 | 49.22 | 34.65 | 24.35 |

DVS Model [28] | 63.86 | 44.98 | 32.58 | 23.29 |

ResNet50 Model [23] | 73.57 | 57.21 | 42.05 | 32.52 |

VGA16 Model | 70.30 | 53.72 | 40.45 | 30.06 |

VGA-16 Model [23] | 74.28 | 59.39 | 43.33 | 33.96 |

HPTDL Model [25] | 75.12 | 60.21 | 44.22 | 34.75 |

