Towards Fault-Aware Image Captioning: A Review on Integrating Facial Expression Recognition (FER) and Object Detection
Abstract
1. Introduction
2. Background and Literature Review on Image Captioning for Fault-Aware PHM
2.1. Development of Image Captioning
2.1.1. Convolutional Neural Networks (CNNs)
- Non-Attentive CNN Features: The performance of models that handle visual inputs, such as those used for picture captioning, has significantly increased with the introduction of convolutional neural networks (CNNs). A straightforward approach involves extracting high-level image features from one of the final layers of a CNN, which then serve as conditioning signals for the language model (Figure 4). The output from GoogleNet [43] was used as the language model’s initial hidden state in the groundbreaking “Show and Tell” work [61], which was the first to employ this tactic. Global features obtained from AlexNet [40] were then used similarly by Karpathy et al. [62]. Furthermore, global features from the VGG network [42] were integrated into the language model at each time step by Mao et al. [63] and Donahue et al. [64]. Many image captioning architectures later adopted these global CNN properties as a key element [65,66,67,68,69,70,71,72]. For example, the FC model, which was proposed by Rennie et al. [73], encodes images using ResNet101 [44] while maintaining their original spatial dimensions. By using high-level semantic features or tags, which are represented as probability distributions over frequently occurring terms in the training captions, other approaches [74,75] improved the procedure. Global CNN features’ primary benefit is their compactness and simplicity, which allow for effective information extraction and contextual representation of the full image. The creation of accurate and fine-grained descriptions may be hampered by this method’s drawbacks, which include the excessive compression of visual features and a lack of granularity. In fault-aware image captioning for Industry 4.0, these non-attentive CNN features offer a compact baseline for rapid fault detection in manufacturing environments, such as identifying initial anomalies in machinery. However, their granularity limitations highlight the need for integration with FER and object detection to generate detailed captions that support PHM by incorporating operator expressions for enhanced system diagnostics.
- Attention Over Grid: In response to problems with global representations, some methods have improved visual encoding by making it more granular [73,76,77]. In order to include spatial structure directly into the language model, Dai et al. [78] used 2D activation maps instead of 1D global feature vectors. Instead, many in the image captioning community have been inspired by machine translation and have used the mechanism based on additive attention (Figure 5). This has given image captioning systems the ability to encode visual elements that change over time, which allows for more customization and finer definitions. The concept of additive attention can be understood as a form of weighted averaging, which was first introduced in a sequence alignment model by Bahdanau et al. [79], using a one-hidden layer feed-forward network in order to determine the score for attention alignment:An innovative method that utilizes additive attention on the convolutional layer’s spatial grid was presented by Xu et al. [76]. Using this technique, the model may zero in on specific areas of the grid by selecting appropriate feature subsets for each word in the output string. First, activations are obtained from the last convolutional layer of a VGG architecture [42]. Then, weights are assigned to particular grid points using additive attention, representing their relevance in predicting the next word.
- Attention Over Visual Regions: Neuroscience indicates that the brain combines top–down cognitive processes with bottom–up visual signals to account for saliency. The top–down pathway utilizes prior knowledge and inductive bias to predict sensory inputs, whereas the bottom–up pathway adjusts these predictions according to actual visual stimuli. Top–down additive attention functions based on this principle. This method involves the language model forecasting the subsequent word by referencing a feature grid characterized by image-independent geometry, effectively integrating signals from both directions. Anderson et al. [80] present a bottom–up mechanism facilitated by an object detector that suggests visual regions, in contrast to conventional saliency-based approaches [81]. A top–down module is employed to assign weights to these regions for the purpose of word prediction (Figure 6). Faster R-CNN [11] is utilized for object detection, producing pooled feature vectors for each region proposal. The pre-training strategy employs an auxiliary loss to predict object and attribute classes using the Visual Genome dataset [82]. This allows the model to capture a comprehensive array of detections, encompassing salient objects and contextual regions, while developing robust feature representations.Image-area features have traditionally been a fundamental component in image captioning due to their efficacy in processing raw visual input. As a result, numerous later studies have utilized this method for visual encoding [83,84,85,86]. Two significant variations are particularly noteworthy. Zha et al. [87] propose a sub-policy network that sequentially interprets visual components by encoding historical visual actions, such as previously attended regions, through an LSTM, which subsequently provides context for the next attention decision. In conventional visual attention, typically only a single image region is focused on at each step.In fault-aware image captioning for Industry 4.0, attention over visual regions bolsters PHM by integrating bottom–up object detection with top–down weighting, enabling the precise localization of faults in complex manufacturing scenes. This approach, when combined with FER, generates enriched captions that capture operator expressions alongside detected anomalies, supporting real-time diagnostics and predictive maintenance.Industrial Relevance: In manufacturing environments, CNN-based visual encoding can capture subtle anomalies in operator facial expressions that correlate with equipment malfunctions. For instance, concentrated frowning patterns detected via ResNet features have been observed to precede quality-control issues by 3–5 min, enabling early warning.
2.1.2. Transformers
- Self-Attention Encoding: To calculate a more accurate representation of the same set of components using residual connections, one can employ self-attention, an attentive process in which all elements of a set are linked to each other (Figure 8). The Transformer architecture and its derivatives, which have come to dominate the fields of natural language processing (NLP) and computer vision (CV), were initially introduced by Vaswani et al. [88] for use in machine translation and language interpretation tasks. In essence, a self-attention layer enhances each element of a sequence by consolidating comprehensive information from the entire input sequence. Let X be a matrix in Rn×d that represents a sequence of n entities (x1, x2, · · · xn), where d is the embedding dimension used to represent each item. The objective of self-attention is to capture the interplay between all n entities by representing each entity in relation to the overall contextual information. To do this, three weight matrices are introduced: (a learnable matrix of size dx ) to transform queries, (a learnable matrix of size d × ) to transform keys, and (a learnable matrix of size d × ) to change values. It is important to note that is equal to . The input sequence X is initially transformed by projecting it onto the weight matrices , , and , resulting in the matrices Q = , K = , and V = . The resulting output Z is produced by the self-attention layer as shown in Equation (2):In a formal tone, the early self-attention approaches can be summarized as follows. Yang et al. [89]’s model was among the first image captioning models to utilize a self-attentive module for encoding relationships between features obtained from an object detector. Later, Li et al. [90] proposed a Transformer model that included a visual encoder for region features and a semantic encoder that leverages knowledge from an external tagger. Both encoders utilized self-attention and feed-forward layers, and their outputs were combined through a gating mechanism that controlled the propagation of visual and semantic information.
2.1.3. Language Models
- Recurrent Neural Networks (RNN): In deep learning, recurrent neural networks (RNNs) are one method for representing sequential data. Until attention models came along, RNNs were the go-to recommendation for dealing with sequential data. A deep feed-forward model could ask for unique parameters for each sequence element. The ability to generalize to sequences of varying lengths may also be lacking.An RNN processes a sequence of text by receiving each word as input and passing information from the previous word to the next network. The hidden state is sent to the decoding step to produce the finished sentence as can be seen in Figure 9. RNNs struggle with long data sequences as gradients carry over information, making parameter updates negligible when gradients become too small. Later, LSTM solved the long-dependency problems.
- Gated Recurrent Unit (GRU): GRU models sequential data by selectively remembering or forgetting information, like LSTM. Its simplified architecture and fewer parameters make GRU easier to train and more computationally efficient than LSTM.How GRU and LSTM manage memory cell state is the fundamental distinction. The input gate, output gate, and forget gate update the memory cell state separately from the hidden state in LSTM. GRU replaces memory cell state with a “candidate activation vector”, updated by the reset and update gates.The reset gate decides how much of the prior hidden state to forget, whereas the update gate decides how much of the candidate activation vector to include.For sequential data modeling, GRU is a popular alternative to LSTM, especially when computational resources are restricted or a simpler design is sought as can be seen in Figure 10.
- LSTM: LSTM (Long Short-Term Memory) is a type of recurrent neural network that is designed to handle the problem of vanishing gradients in traditional RNNs. It is capable of learning long-term dependencies in data and is particularly effective for tasks such as natural language processing and time series prediction. LSTMs [94] have been used in a wide range of applications, including speech recognition, machine translation, and predictive modeling. One key feature of LSTMs is their ability to selectively remember or forget information over long periods of time, making them well-suited for tasks that require a memory of previous inputs as can be seen in Figure 11.
- Transformers: The viewpoint on language generation has been radically altered by the fully attentive paradigm put forth by Vaswani et al. [88]. Not long after, the Transformer model became the de facto norm for many language processing tasks and the foundation for future NLP achievements like BERT [95] and GPT [96]. Image captioning is likewise carried out using the Transformer architecture because it is a sequence-to-sequence challenge. A masking method is used to restrict a unidirectional generation process during training by applying it to the prior words. Some image captioning models have used the original Transformer decoder without major architectural changes [97,98,99,100]. Additionally, changes to enhance visual feature encoding and language creation have been suggested. The encoder–decoder architecture of a typical Transformer is shown in Figure 12.
3. Facial Expression Recognition (FER) for Fault-Aware Image Captioning
3.1. Introduction
3.2. Importance of FER in Image Captioning
4. Object Detection for Fault-Aware Image Captioning in Industry 4.0
4.1. Introduction
- Object Detection Accuracy: Accurately detecting all relevant elements in an image is one of the main issues in image captioning. Included in this category are not just the scene’s primary elements but also any minor or obscured components that are essential to grasping the full picture.
- Contextual Understanding: Critical to comprehending the appearance of images is knowing their context. For example, the captioning system should be able to decipher subtle differences in context-dependent meanings of objects.
- Relationships Between Objects: The process of determining the connections between picture elements is intricate. Some examples of these types of partnerships include physical placement (two objects resting on top of each other), action (a person pedaling a bicycle), and conceptual (emotional bonds).
- Handling Ambiguity: Multiple interpretations are possible due to the presence of unclear features or scenarios in images. The development of a system capable of producing correct captions while dealing with such difficulties is no easy task.
- Diverse Representation: Pictures show all sorts of things and people from all around the world. The captioning system must be designed to be compatible with a wide range of cultures, situations, and scenarios in order to be effective.
- Facial Expressions and Emotions: Complicating matters further for our particular study is the need to correctly decipher facial expressions and incorporate this data into the caption. The system’s ability to accurately detect emotions and convey them in captions depends on how well it fits the picture as a whole.
- Natural Language Generation: Producing correct captions that also seem natural and human-like is no easy feat. In order to generate captions that are both intelligible and suitable for their context, the system must comprehend linguistic subtleties, syntax, and style.
- Real-Time Processing: Live video analysis and assistive technology for visually impaired users are two examples of applications that rely on real-time captioning. In these cases, the system’s image processing and caption generation capabilities must be top notch.
- Training Data and Bias: Image captioning systems are highly sensitive to the variety and quality of their training material. Inaccurate or biased captions, especially when it comes to cultural or demographic representation, might be caused by bias in the training data.
- Computational Efficiency: The computing demands of image captioning systems can be high, particularly when incorporating object detection and facial expression analysis. When designing practical applications, it is crucial to strike a balance between accuracy and computing efficiency.
4.2. Importance of Object Detection in Image Captioning
4.2.1. Traditional Object Detection Methods
- Histogram of Oriented Gradients (HOG) [36]: The HOG descriptors, initially proposed by Dalal and Triggs in 2005 for the purpose of pedestrian identification, entail the computation and enumeration of gradient orientations within specific regions of an image. The technique is highly efficient for detecting objects in computer vision, notably for identifying humans. SVM, or Support Vector Machine, is frequently employed as the classifier in conjunction with HOG characteristics.
- Scale-Invariant Feature Transform (SIFT) [114]: David Lowe created SIFT in 1999 to identify and characterize image local features. It is used not only for detection but object recognition as well. The algorithm remains mostly unchanged regardless of changes to lighting and 3D camera perspective, and it also remains unchanged regardless of scale and rotation.
- Speeded Up Robust Features (SURF) [115]: With its enhanced speed and efficiency, SURF proves to be a perfect fit for real-time applications, building upon the foundation laid by SIFT. First introduced in 2006 by Bay, Tuytelaars, and Van Gool, SURF utilizes integral images to efficiently perform image convolutions. This enables it to rapidly identify interest points, which are subsequently described and compared across images for the purpose of object detection and recognition.
4.2.2. Deep Learning-Based Object Detection Methods
- You Only Look Once (YOLO): In order to greatly improve computational efficiency, YOLO is a one-stage object detection algorithm that scans the whole picture simultaneously. It uses a grid to partition the picture and assigns probability for classes and bounding boxes to each cell in the grid.
- Faster R-CNN: By utilizing an enhanced region proposal network (RPN) to provide object suggestions, Faster R-CNN integrates the advantages of R-CNN and Fast R-CNN. To further reduce computational complexity, it also proposes a technique called ROI pooling to extract features from the suggestions.
- RetinaNet: To improve its capacity to identify objects of varied sizes and scales, RetinaNet, another one-stage object identification approach, uses a feature pyramid network (FPN) to extract features from several scales.
- DETR (Detection Transformer): Detection Transformer, or DETR for short, is an innovative approach to object detection that has recently become popular for its comprehensiveness and efficiency. It skips steps like object proposal creation and post-processing by using a transformers architecture to anticipate object bounding boxes and class labels directly. Because of this, DETR outperforms more conventional object detection algorithms like RetinaNet and Faster R-CNN. Nevertheless, with DETR, its dependence on global attention makes it less successful at identifying fine-grained details, which can make it struggle with little objects at times. In Table 2 we can see the pros and cons summarized.
5. Datasets
5.1. Image Captioning Datasets
5.2. Facial Expression Recognition Datasets
5.3. Object Detection
6. Integration of Facial Expression Recognition and Object Detection in Image Captioning
6.1. Existing Approaches and Methodologies
6.2. Comparative Analysis and Performance Trade-Offs
6.3. Challenges in Integration
6.4. Importance and Applications in Fault-Aware Systems
- Enhanced Semantic Understanding: Integration allows captions like “Stressed operator near faulty conveyor belt,” combining FER (stress) with object detection (belt anomaly), improving diagnostics [200].
- Fault Detection in PHM: In Industry 4.0, operator emotions signal early faults, e.g., surprise at machine vibrations triggers maintenance, reducing downtime by 20–30% in simulations [197]. Analogous to neurological monitoring via FER [203]. BLIP and CLIP enhance this in surveillance-like setups for real-time alerts [198].
- Improved Accessibility and Collaboration: Emotion-rich captions aid human–robot interfaces, e.g., detecting frustration for adaptive responses.
7. Challenges and Future Directions
- Ethical and Privacy Concerns: Concerns about permission and abuse are only two of the many privacy and ethical concerns brought up by facial expression recognition technology. Protecting people’s privacy and ensuring ethical use are of utmost importance.
- Data Diversity and Bias: To prevent biases and make sure the models work effectively across many demographics and situations, training these integrated systems needs broad and extensive datasets.
8. Conclusions
- The transition from global CNN features to attention-based mechanisms and transformer architectures has significantly improved the granularity and accuracy of image descriptions.
- Facial expression recognition adds a crucial emotional dimension to image understanding, enabling captions that reflect the emotional states and intentions of subjects within images.
- Object detection techniques, particularly modern approaches like YOLO and DETR, provide the spatial and relational context necessary for generating precise and meaningful captions.
- The integration of these technologies is particularly valuable for fault-aware systems in industrial settings, where operator facial expressions combined with visual monitoring can provide early indicators of system anomalies or operational issues.
- Development of lightweight, efficient architectures that can perform integrated FER, object detection, and captioning in real-time on edge devices, crucial for Industry 4.0 applications.
- Creation of specialized datasets that include annotated facial expressions, object relationships, and contextual information specifically designed for industrial and PHM applications.
- Investigation of few-shot and zero-shot learning approaches to reduce dependency on large annotated datasets while maintaining performance across diverse scenarios.
- Exploration of multimodal fusion techniques that can incorporate additional sensory data (audio, thermal, and vibration) alongside visual information for more comprehensive system understanding.
- Development of explainable AI techniques that can provide insights into how emotional and object-based features influence caption generation, essential for critical applications in industrial settings.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Hinami, R.; Matsui, Y.; Satoh, S.I. Region-based image retrieval revisited. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 528–536. [Google Scholar]
- Hand, E.; Chellappa, R. Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31, pp. 4068–4074. [Google Scholar]
- Cheng, X.; Lu, J.; Feng, J.; Yuan, B.; Zhou, J. Scene recognition with objectness. Pattern Recognit. 2018, 74, 474–487. [Google Scholar] [CrossRef]
- Meng, Z.; Yu, L.; Zhang, N.; Berg, T.L.; Damavandi, B.; Singh, V.; Bearman, A. Connecting what to say with where to look by modeling human attention traces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12679–12688. [Google Scholar]
- Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep learning for generic object detection: A survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
- Gurari, D.; Zhao, Y.; Zhang, M.; Bhattacharya, N. Captioning images taken by people who are blind. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland,, 2020; pp. 417–434. [Google Scholar]
- Li, S.; Deng, W. Deep Facial Expression Recognition: A Survey. IEEE Trans. Affect. Comput. 2018, 13, 1195–1215. [Google Scholar] [CrossRef]
- Mahaur, B.; Singh, N.; Mishra, K.K. Road object detection: A comparative study of deep learning-based algorithms. Multimed Tools Appl. 2022, 81, 14247–14282. [Google Scholar] [CrossRef]
- Ross, G.; Jef, D.; Trevor, D.; Jitendra, M. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the CVPR ‘14, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Girshick, R. Fast R-CNN. In Proceedings of the ICCV, Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Bochkovskiy, A.; Wang, C.-H.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; NanoCode012; Kwon, Y.; Michael, K.; Xie, T.; Fang, J.; imyhxy; et al. ‘Ultralytics/yolov5: V7.0 - Yolov5 SOTA Realtime Instance Segmentation’. Zenodo, 22 November 2022; Available online: https://zenodo.org/records/7347926 (accessed on 12 August 2025). [CrossRef]
- Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. Yolov6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
- Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
- Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. Available online: https://github.com/ultralytics/ultralytics (accessed on 30 December 2023).
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the ECCV, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the ’ICCV, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
- Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
- Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the ICCV, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Fang, Y.; Liao, B.; Wang, X.; Fang, J.; Qi, J.; Wu, R.; Niu, J.; Liu, W. You only look at one sequence: Rethinking transformer in vision through object detection. Adv. Neural Inf. Process. Syst. 2021, 34, 26183–26197. [Google Scholar]
- Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse R-CNN: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar]
- Nezami, O.M.; Dras, M.; Wan, S.; Paris, C. Image Captioning Using Facial Expression and Attention. J. Artif. Intell. Res. 2020, 68, 661–689. [Google Scholar] [CrossRef]
- Al-Malla, M.A.; Jafar, A.; Ghneim, N. Image captioning model using attention and object features to mimic human image understanding. J. Big Data 2022, 9, 20. [Google Scholar] [CrossRef]
- Stefanini, M.; Cornia, M.; Baraldi, L.; Cascianelli, S.; Fiameni, G.; Cucchiara, R. From show to tell: A survey on deep learning-based image captioning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 539–559. [Google Scholar] [CrossRef]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 652–663. [Google Scholar] [CrossRef]
- Ojala, T.; Pietikäinen, M.; Mäenpää, T. Gray scale and rotation invariant texture classification with local binary patterns. In Proceedings of the European Conference on Computer Vision, Dublin, Ireland, 26 June–1 July 2000; Springer: Cham, Switzerland, 2000; pp. 404–420. [Google Scholar]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, Pennsylvania, 27–29 July 1992; ACM: New York, NY, USA, 1992; pp. 144–152. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. arXiv 2014, arXiv:1409.0575. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G. Imagenet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, CA, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
- Zeiler, M.D.; Fergus, R. Visualizing and Understanding Convolutional Networks. arXiv 2013, arXiv:1311.2901. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. arXiv 2014, arXiv:1409.4842. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar] [CrossRef]
- Kiros, R.; Salakhutdinov, R.; Zemel, R. Multimodal Neural Language Models. In Proceedings of the 31st International Conference on Machine Learning—Volume 32, Beijing, China, 21–26 June 2014; pp. 595–603. [Google Scholar]
- Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 1980, 36, 193–202. [Google Scholar] [CrossRef]
- Waibel, A.; Hanazawa, T.; Hinton, G.; Shikano, K.; Lang, K.J. Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech, Signal Process. 1989, 37, 328–339. [Google Scholar] [CrossRef]
- Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. arXiv 2021, arXiv:2101.11986. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jegou, H. Training data-efficient image transformers and distillation through attention. arXiv 2020, arXiv:2012.12877. [Google Scholar]
- Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision transformers. arXiv 2021, arXiv:2103.15808. [Google Scholar] [CrossRef]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Graham, B.; El-Nouby, A.; Touvron, H.; Stock, P.; Joulin, A.; Jégou, H.; Douze, M. Levit: A vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 12259–12269. [Google Scholar]
- Huang, G.; Liu, Z.; Maaten, L.V.D.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Chen, C.-F.; Fan, Q.; Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. arXiv 2021, arXiv:2103.14899. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
- Bao, H.B.; Dong, L.; Wei, F.R. BEiT: BERT pre-training of image transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar]
- Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the ICML, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
- Dai, Z.; Liu, H.; Le, Q.V.; Tan, M. Coatnet: Marrying Convolution and Attention for All Data Sizes. 2021. Available online: https://proceedings.neurips.cc/paper/2021/hash/20568692db622456cc42a2e853ca21f8-Abstract.html (accessed on 12 August 2025).
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Huang, Z.; Yuille, A. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). arXiv 2015, arXiv:1412.6632. [Google Scholar] [CrossRef]
- Donahue, J.; Hendricks, L.A.; Rohrbach, M.; Venugopalan, S.; Guadarrama, S.; Saenko, K.; Darrell, T. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 677–691. [Google Scholar] [CrossRef]
- Chen, X.; Lawrence Zitnick, C. Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Fang, H.; Gupta, S.; Iandola, F.; Srivastava, R.K.; Deng, L.; Dollar, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J.C.; et al. From captions to visual concepts and back. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Jia, X.; Gavves, E.; Fernando, B.; Tuytelaars, T. Guiding the Long-Short Term Memory model for Image Caption Generation. In Proceedings of the ICCV, Santiago, Chile, 7–13 December 2015. [Google Scholar]
- You, Q.; Jin, H.; Wang, Z.; Fang, C.; Luo, J. Image captioning with semantic attention. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Wu, Q.; Shen, C.; Liu, L.; Dick, A.; Hengel, A.V.D. What Value Do Explicit High Level Concepts Have in Vision to Language Problems? In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Gu, J.; Wang, G.; Cai, J.; Chen, T. An Empirical Study of Language CNN for Image Captioning. In Proceedings of the ICCV, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Chen, F.; Ji, R.; Su, J.; Wu, Y.; Wu, Y. StructCap: Structured Semantic Embedding for Image Captioning. In Proceedings of the ACM Multimedia, Goa, India, 23–27 October 2017. [Google Scholar]
- Chen, F.; Ji, R.; Sun, X.; Wu, Y.; Su, J. GroupCap: Group-based Image Captioning with Structured Relevance and Diversity Constraints. In Proceedings of the CVPR, Salt Lake City, UT, USA, 16–23 June 2018. [Google Scholar]
- Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Yao, T.; Pan, Y.; Li, Y.; Qiu, Z.; Mei, T. Boosting image captioning with attributes. In Proceedings of the ICCV, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Gan, Z.; Gan, C.; He, X.; Pu, Y.; Tran, K.; Gao, J.; Carin, L.; Deng, L. Semantic Compositional Networks for Visual Captioning. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the ICML, Lille, France, 7–9 July 2015. [Google Scholar]
- Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Dai, B.; Ye, D.; Lin, D. Rethinking the form of latent states in image captioning. In Proceedings of the ECCV, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the ICLR, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Chen, S.; Zhao, Q. Boosted attention: Leveraging human attention for image captioning. In Proceedings of the ECCV, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D.A.; et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef]
- Ke, L.; Pei, W.; Li, R.; Shen, X.; Tai, Y.-W. Reflective Decoding Network for Image Captioning. In Proceedings of the ICCV, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Qin, Y.; Du, J.; Zhang, Y.; Lu, H. Look Back and Predict Forward in Image Captioning. In Proceedings of the CVPR, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Huang, L.; Wang, W.; Xia, Y.; Chen, J. Adaptively Aligned Image Captioning via Adaptive Attention Time. In Proceedings of the NeurIPS, Vancouver, BC, Canada, 13 December 2019. [Google Scholar]
- Wang, L.; Bai, Z.; Zhang, Y.; Lu, H. Show, Recall, and Tell: Image Captioning with Recall Mechanism. In Proceedings of the AAAI, New York, NY, USA, 7–12 February 2020. [Google Scholar]
- Zha, Z.-J.; Liu, D.; Zhang, H.; Zhang, Y.; Wu, F. Context-aware visual policy network for fine-grained image captioning. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 44, 710–722. [Google Scholar] [CrossRef] [PubMed]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the NeurIPS, Long Beach, CA, USA, 8 December 2017. [Google Scholar]
- Yang, X.; Zhang, H.; Cai, J. Learning to Collocate Neural Modules for Image Captioning. In Proceedings of the ICCV, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Li, G.; Zhu, L.; Liu, P.; Yang, Y. Entangled Transformer for Image Captioning. In Proceedings of the ICCV, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 7354–7363. [Google Scholar]
- Bollerslev, T.; Engle, R.F.; Nelson, D.B. ARCH models. Handb. Econom. 1994, 4, 2959–3038. [Google Scholar]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the NAACL, New Orleans, LA, USA, 1–6 June 2018. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035 (accessed on 12 August 2025).
- Herdade, S.; Kappeler, A.; Boakye, K.; Soares, J. Image Captioning: Transforming Objects into Words. In Proceedings of the NeurIPS, Vancouver, BC, Canada, 13 December 2019. [Google Scholar]
- Guo, L.; Liu, J.; Zhu, X.; Yao, P.; Lu, S.; Lu, H. Normalized and geometry-aware self-attention network for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10327–10336. [Google Scholar]
- Luo, Y.; Ji, J.; Sun, X.; Cao, L.; Wu, Y.; Huang, F.; Lin, C.-W.; Ji, R. Dual-Level Collaborative Transformer for Image Captioning. In Proceedings of the AAAI, Online, 2–9 February 2021. [Google Scholar]
- Wang, Z.; Yu, J.; Yu, A.W.; Dai, Z.; Tsvetkov, Y.; Cao, Y. SimVLM: Simple visual language model pretraining with weak supervision. arXiv 2021, arXiv:2108.10904. [Google Scholar]
- Nan, Y.; Ju, J.; Hua, Q.; Zhang, H.; Wang, B. A-MobileNet: An approach of facial expression recognition. Alex. Eng. J. 2022, 61, 4435–4444. [Google Scholar] [CrossRef]
- Li, Z.; Zhang, T.; Jing, X.; Wang, Y. Facial expression-based analysis on emotion correlations, hotspots, and potential occurrence of urban crimes. Alex. Eng. J. 2021, 60, 1411–1420. [Google Scholar] [CrossRef]
- Mannepalli, K.; Sastry, P.N.; Suman, M. A novel adaptive fractional deep belief networks for speaker emotion recognition. Alex. Eng. J. 2017, 56, 485–497. [Google Scholar] [CrossRef]
- Tonguc¸, G.; Ozkara, B.O. Automatic recognition of student emotions from facial expressions during a lecture. Comput. Educ. 2020, 148, 103797. [Google Scholar] [CrossRef]
- Yun, S.S.; Choi, J.; Park, S.K.; Bong, G.Y.; Yoo, H. Social skills training for children with autism spectrum disorder using a robotic behavioral intervention system. Autism Res. 2017, 10, 1306–1323. [Google Scholar] [CrossRef]
- Li, H.; Sui, M.; Zhao, F.; Zha, Z.; Wu, F. Mvt: Mask vision transformer for facial expression recognition in the wild. arXiv 2021, arXiv:2106.04520. [Google Scholar] [CrossRef]
- Liang, X.; Xu, L.; Zhang, W.; Zhang, Y.; Liu, J.; Liu, Z. A convolution-transformer dual branch network for head-pose and occlusion facial expression recognition. Vis. Comput. 2023, 39, 2277–2290. [Google Scholar] [CrossRef]
- Jeong, M.; Ko, B.C. Driver’s facial expression recognition in real-time for safe driving. Sensors 2018, 18, 4270. [Google Scholar] [CrossRef] [PubMed]
- Kaulard, K.; Cunningham, D.W.; Bülthoff, H.H.; Wallraven, C. The MPI facial expression database—A validated database of emotional and conversational facial expressions. PLoS ONE 2012, 7, e32321. [Google Scholar] [CrossRef] [PubMed]
- Ali, M.R.; Myers, T.; Wagner, E.; Ratnu, H.; Dorsey, E.; Hoque, E. Facial expressions can detect Parkinson’s disease: Preliminary evidence from videos collected online. NPJ Digital Med. 2021, 4, 1–4. [Google Scholar] [CrossRef]
- Du, Y.; Zhang, F.; Wang, Y.; Bi, T.; Qiu, J. Perceptual learning of facial expressions. Vision Res. 2016, 128, 19–29. [Google Scholar] [CrossRef]
- Nezami, O.M.; Dras, M.; Anderson, P.; Hamey, L. Face-cap: Image captioning using facial expression analysis. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Cham, Switzerland, 2018; pp. 226–240. [Google Scholar]
- Rieder, M.; Verbeet, R. Robot-Human-Learning for Robotic Picking Processes. 2019. Available online: https://tore.tuhh.de/entities/publication/b89d0bf5-da16-4138-a979-a8a59d75b8d5 (accessed on 12 August 2025). [CrossRef]
- Lowe, D.G. Object Recognition from Local Scale-Invariant Features. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 2, pp. 1150–1157. [Google Scholar]
- Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded Up Robust Features. In Proceedings of the 9th European Conference on Computer Vision—Volume Part I, ECCV’06, Graz, Austria, 7–13 May 2006; pp. 404–417. [Google Scholar]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [Google Scholar] [CrossRef]
- Hodosh, M.; Young, P.; Hockenmaier, J. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 2013, 47, 853–899. [Google Scholar] [CrossRef]
- Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset for Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2556–2565. [Google Scholar]
- Changpinyo, S.; Sharma, P.; Ding, N.; Soricut, R. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 3558–3568. [Google Scholar]
- Ordonez, V.; Kulkarni, G.; Berg, T.L. Im2Text: Describing Images Using 1 Million Captioned Photographs. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Granada, Spain, 12–14 December 2011; pp. 1143–1151. [Google Scholar]
- Gurari, D.; Li, Q.; Stangl, A.J.; Guo, A.; Lin, C.; Grauman, K.; Luo, J.; Bigham, J.P. VizWiz Grand Challenge: Answering Visual Questions from Blind People. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3608–3617. [Google Scholar]
- Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-UCSD Birds-200–2011 Dataset; Technical Report CNS-TR-2011-001; California Institute of Technology: Pasadena, CA, USA, 2011. [Google Scholar]
- Nilsback, M.-E.; Zisserman, A. Automated Flower Classification over a Large Number of Classes. In Proceedings of the Sixth Indian Conference on Computer Vision, Graphics & Image Processing, Bhubaneswar, India, 16–19 December 2008; pp. 722–729. [Google Scholar]
- Yang, X.; Zhang, W.; Chen, H. Fashion Captioning: Towards Generating Accurate and Fashionable Descriptions. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Ramisa, A.; Yan, F.; Moreno-Noguer, F.; Mikolajczyk, K. BreakingNews: Article Annotation by Image and Text Processing. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2410–2424. [Google Scholar] [CrossRef]
- Biten, A.F.; Gomez, L.; Rusinol, M.; Karatzas, D. Good News, Everyone! Context Driven Entity-Aware Captioning for News Images. In Proceedings of the CVPR, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Sidorov, O.; Hu, R.; Rohrbach, M.; Singh, A. TextCaps: A Dataset for Image Captioning with Reading Comprehension. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 742–758. [Google Scholar]
- Pont-Tuset, J.; Uijlings, J.; Changpinyo, S.; Soricut, R.; Ferrari, V. Connecting Vision and Language with Localized Narratives. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 647–664. [Google Scholar]
- Schuhmann, C.; Vencu, R.; Beaumont, T.; Kaczmarczyk, R.; Mullis, C.; Katta, A.; Jitsev, J. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs. arXiv 2021, arXiv:2111.02114. [Google Scholar]
- Schuhmann, C.; Beaumont, R.; Vencu, R.; Gordon, C.; Wightman, R.; Cherti, M.; Coombes, T.; Katta, A.; Mullis, C.; Wortsman, M.; et al. LAION-5B: An open large-scale dataset for training next generation image-text models. arXiv 2022, arXiv:2210.08402. [Google Scholar]
- Agrawal, H.; Anderson, P.; Desai, K.; Wang, Y.; Jain, A.; Johnson, M.; Batra, D.; Parikh, D.; Lee, S. nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8948–8957. [Google Scholar]
- Onoe, Y.; Rane, S.; Berger, Z.; Bitton, Y.; Cho, J.; Garg, R.; Ku, A.; Parekh, Z.; Pont-Tuset, J.; Tanzer, G.; et al. DOCCI: Descriptions of Connected and Contrasting Images. arXiv 2024, arXiv:2404.19753. [Google Scholar] [CrossRef]
- Pelka, O.; Koitka, S.; Rückert, J.; Nensa, F.; Friedrich, C.M. Radiology Objects in COntext (ROCO): A Multimodal Image Dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis; Springer: Cham, Switzerland, 2018; pp. 180–189. [Google Scholar] [CrossRef]
- Subramanian, S.; Wang, L.L.; Bogin, B.; Mehta, S.; van Zuylen, M.; Parasa, S.; Singh, S.; Gardner, M.; Hajishirzi, H. MedICaT: A Dataset of Medical Images, Captions, and Textual References. In Findings of the Association for Computational Linguistics: EMNLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 2112–2120. [Google Scholar] [CrossRef]
- Hsu, T.-Y.; Giles, C.L.; Huang, T.-H. SciCap: Generating Captions for Scientific Figures. In Findings of the Association for Computational Linguistics: EMNLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 3258–3264. [Google Scholar]
- Marin, J.; Biswas, A.; Ofli, F.; Hynes, N.; Salvador, A.; Aytar, Y.; Weber, I.; Torralba, A. Recipe1M+: A Dataset for Learning Cross-Modal Embeddings for Cooking Recipes and Food Images. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 187–203. [Google Scholar] [CrossRef]
- Lu, Y.; Guo, C.; Dai, X.; Wang, F.Y. ArtCap: A Dataset for Image Captioning of Fine Art Paintings. IEEE Trans. Comput. Soc. Syst. 2022, 11, 576–587. [Google Scholar] [CrossRef]
- Yoshikawa, Y.; Shigeto, Y.; Takeichi, A. STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, 30 July–4 August 2017; pp. 417–422. [Google Scholar]
- Thapliyal, A.V.; Pont-Tuset, J.; Chen, X.; Soricut, R. Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset. arXiv 2022, arXiv:2205.12522. [Google Scholar] [CrossRef]
- Chen, T.-S.; Siarohin, A.; Menapace, W.; Deyneka, E.; Chao, H.-W.; Jeon, B.E.; Fang, Y.; Lee, H.-Y.; Ren, J.; Yang, M.-H.; et al. Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers. arXiv 2024, arXiv:2402.19479. [Google Scholar]
- Goodfellow, I.J.; Erhan, D.; Carrier, P.L.; Courville, A.; Mirza, M.; Hamner, B.; Cukierski, W.; Tang, Y.; Thaler, D.; Lee, D.-H.; et al. Challenges in representation learning: A report on the 2013 NIPS workshop. arXiv 2013, arXiv:1307.0414. [Google Scholar]
- Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar]
- Lyons, M.J.; Kamachi, M.; Gyoba, J. Coding facial expressions with Gabor wavelets. In Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 14–16 April 1998; pp. 200–205. [Google Scholar]
- Li, S.; Deng, W.; Du, J. Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2852–2861. [Google Scholar]
- Mollahosseini, A.; Hasani, B.; Mahoor, M.H. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 2017, 10, 18–31. [Google Scholar] [CrossRef]
- Zhang, Z.; Luo, P.; Loy, C.C.; Tang, X. From facial expression recognition to interpersonal relation prediction. Int. J. Comput. Vis. 2018, 126, 550–569. [Google Scholar] [CrossRef]
- Dhall, A.; Goecke, R.; Lucey, S.; Gedeon, T. Static facial expression analysis in tough conditions: Data, features and evaluation. In Proceedings of the 14th ACM International Conference on Multimodal Interaction, Santa Monica, CA, USA, 22–26 October 2012; pp. 425–432. [Google Scholar]
- Mavadati, S.M.; Mahoor, M.H.; Bartlett, K.; Trinh, P.; Cohn, J.F. ‘DISFA: A Spontaneous Facial Action Intensity Database. IEEE Trans. Affect. Comput. 2013, 4, 151–160. [Google Scholar] [CrossRef]
- Pantic, M.; Valstar, M.; Rademaker, R.; Maat, L. Web-based database for facial expression analysis. In Proceedings of the 2005 IEEE International Conference on Multimedia and Expo, Amsterdam, The Netherlands, 6–9 July 2005; p. 5. [Google Scholar] [CrossRef]
- Zhang, X.; Yin, L.; Cohn, J.F.; Canavan, S.; Reale, M.; Horowitz, A.; Liu, P. A High-Resolution 3D Dynamic Facial Expression Database. In Proceedings of the 8th International Conference on Automatic Face and Gesture Recognition, Amsterdam, The Netherlands, 17–19 September 2008. [Google Scholar]
- Langner, O.; Dotsch, R.; Bijlstra, G.; Wigboldus, D.H.J.; Hawk, S.T.; van Knippenberg, A. Presentation and validation of the Radboud Faces Database. Cogn. Emot. 2010, 24, 1377–1388. [Google Scholar] [CrossRef]
- Zhao, G.; Huang, X.; Taini, M.; Li, S.Z.; Pietikäinen, M. Facial expression recognition from near-infrared videos. Image Vis. Comput. 2011, 29, 607–619. [Google Scholar] [CrossRef]
- Haq, S.; Jackson, P.J.B. Multimodal Emotion Recognition. In Machine Audition: Principles, Algorithms and Systems; Wang, W., Ed.; IGI Global Press: Hershey, PA, USA, 2010; Chapter 17; pp. 398–423. ISBN 978-1615209194. [Google Scholar]
- Lundqvist, D.; Flykt, A.; Öhman, A. The Karolinska Directed Emotional Faces (KDEF); [Data set/CD-ROM]; Karolinska Institutet, Department of Clinical Neuroscience, Psychology Section: Stockholm, Sweden, 1998; ISBN 91-630-7164-9. [Google Scholar] [CrossRef]
- Kollias, D.; Zafeiriou, S. Aff-Wild2: Extending the Aff-Wild Database for Affect Recognition. arXiv 2018, arXiv:1811.07770. [Google Scholar]
- Aneja, D.; Colburn, A.; Faigin, G.; Shapiro, L.; Mones, B. Modeling stylized character expressions via deep learning. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Springer: Cham, Switzerland, 2016; pp. 136–153. [Google Scholar]
- Ebner, N.C.; Riediger, M.; Lindenberger, U. FACES—A database of facial expressions in young, middle-aged, and older women and men: Development and validation. Behav. Res. Methods 2010, 42, 351–362. [Google Scholar] [CrossRef]
- Olszanowski, M.; Pochwatko, G.; Kuklinski, K.; Scibor-Rylski, M.; Lewinski, P.; Ohme, R.K. Warsaw set of emotional facial expression pictures: A validation study of facial display photographs. Front. Psychol. 2015, 5, 1516. [Google Scholar] [CrossRef]
- Ma, K.; Wang, X.; Yang, X.; Zhang, M.; Girard, J.M.; Morency, L.P. ElderReact: A multimodal dataset for recognizing emotional response in aging adults. In Proceedings of the 2019 International Conference on Multimodal Interaction, Suzhou, China, 14–18 October 2019; ACM: New York, NY, USA, 2019; pp. 349–357. [Google Scholar]
- Nojavanasghari, B.; Baltrušaitis, T.; Hughes, C.E.; Morency, L.P. EmoReact: A multimodal approach and dataset for recognizing emotional responses in children. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, Tokyo Japan, 12–16 November 2016; ACM: New York, NY, USA, 2016; pp. 137–144. [Google Scholar]
- Khan, R.A.; Crenn, A.; Meyer, A.; Bouakaz, S. A novel database of children’s spontaneous facial expressions (LIRIS-CSE). Image Vis. Comput. 2019, 83, 61–69. [Google Scholar] [CrossRef]
- Zhang, L.; Walter, S.; Ma, X.; Werner, P.; Al-Hamadi, A.; Traue, H.C.; Gruss, S. “BioVid Emo DB”: A multimodal database for emotion analyses validated by subjective ratings. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Athens, Greece, 6–9 December 2016; pp. 1–6. [Google Scholar] [CrossRef]
- Saganowski, S.; Komoszyńska, J.; Behnke, M.; Perz, B.; Kunc, D.; Klich, B.; Kaczmarek, Ł.D.; Kazienko, P. Emognition dataset: Emotion recognition with self-reports, facial expressions, and physiology using wearables. Sci. Data 2022, 9, 158. [Google Scholar] [CrossRef]
- Dalrymple, K.A.; Gomez, J.; Duchaine, B. The Dartmouth Database of Children’s Faces: Acquisition and Validation of a New Face Stimulus Set. PLoS ONE 2013, 8, e79131. [Google Scholar] [CrossRef] [PubMed]
- Rizvi, S.S.A.; Seth, A.; Challa, J.S.; Narang, P. InFER++: Real-World Indian Facial Expression Dataset. IEEE Open J. Comput. Soc. 2024, 5, 406–417. [Google Scholar] [CrossRef]
- Tutuianu, G.I.; Liu, Y.; Alamäki, A.; Kauttonen, J. Benchmarking deep Facial Expression Recognition: An extensive protocol with balanced dataset in the wild. Eng. Appl. Artif. Intell. 2024, 136 Pt B, 108983. [Google Scholar] [CrossRef]
- Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. (IJCV) 2010, 88, 303–338. [Google Scholar] [CrossRef]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, Florida, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al. The open images dataset v4. Int. J. Comput. Vis. (IJCV) 2020, 128, 1956–1981. [Google Scholar] [CrossRef]
- Gupta, A.; Dollar, P.; Girshick, R. LVIS: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5356–5364. [Google Scholar]
- Shao, S.; Li, Z.; Zhang, T.; Peng, C.; Yu, G. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8430–8439. [Google Scholar]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
- Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2446–2454. [Google Scholar]
- Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7778–7796. [Google Scholar] [CrossRef]
- Lam, D.; Kuzma, R.; McGee, K.; Doerr, S.; Lai, C. xView: Object detection in overhead imagery. arXiv 2018, arXiv:1802.07856. [Google Scholar]
- Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3462–3471. [Google Scholar] [CrossRef]
- Setio, A.A.A.; Traverso, A.; De Bel, T.; Berens, M.S.; van den Bogaard, C.; Cerello, P.; Chen, H.; Dou, Q.; Fantacci, M.E.; Geurts, B.; et al. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in CT images: The LUNA16 challenge. Med. Image Anal. 2017, 42, 1–13. [Google Scholar] [CrossRef]
- Bergmann, P.; Fauser, M.; Sattlegger, D.; Steger, C. MVTec AD–A comprehensive real-world dataset for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9592–9600. [Google Scholar]
- Song, K.; Yan, Y. A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects. Appl. Surf. Sci. 2013, 285, 858–864. [Google Scholar] [CrossRef]
- Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. BDD100K: A diverse driving dataset for heterogeneous multitasking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8263–8272. [Google Scholar]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
- Zhou, B.; Zhao, H.; Puig, X.; Xiao, T.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic Understanding of Scenes through the ADE20K Dataset. Int. J. Comput. Vis. 2019, 127, 302–321. [Google Scholar] [CrossRef]
- Yang, S.; Luo, P.; Loy, C.C.; Tang, X. Wider face: A face detection benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5525–5533. [Google Scholar]
- Guo, Y.; Zhang, L.; Hu, Y.; He, X.; Gao, J. MS-Celeb-1M: A dataset and benchmark for large-scale face recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 87–102. [Google Scholar]
- Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
- Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
- Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7749–7773. [Google Scholar] [CrossRef]
- Goldman, E.; Herzig, R.; Eisenschtat, A.; Novotny, J.; Dror, T. Precise detection in densely packed scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5227–5236. [Google Scholar]
- Ge, Y.; Zhang, R.; Wang, X.; Tang, X.; Luo, J. Deepfashion2: A versatile benchmark for detection, pose estimation, segmentation, and re-identification of clothing items. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5337–5345. [Google Scholar]
- Tschandl, P.; Rosendahl, C.; Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 2018, 5, 1–9. [Google Scholar] [CrossRef]
- Priya, K.; Karthika, P.; Kaliappan, J.; Selvaraj, S.K.; Molla, N.R.B. Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention. Appl. Comput. Intell. Soft Comput. 2022, 13, 2756396. [Google Scholar] [CrossRef]
- Haque, N.; Labiba, I.; Akter, S. FaceGemma: Enhancing Image Captioning with Facial Attributes for Portrait Images. arXiv 2023, arXiv:2309.13601. [Google Scholar]
- Huang, F. OPCap: Object-aware Prompting Captioning. arXiv 2024, arXiv:2412.00095. [Google Scholar]
- Ye, C.; Chen, W.; Li, J.; Zhang, L.; Mao, Z. Dual-path collaborative generation network for emotional video captioning. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 496–505. [Google Scholar]
- Iwamura, K.; Louhi Kasahara, J.Y.; Moro, A.; Yamashita, A.; Asama, H. Image Captioning Using Motion-CNN with Object Detection. Sensors 2021, 21, 1270. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
- Kong, Z.; Li, W.; Zhang, H.; Yuan, X. FocusCap: Object-focused image captioning with CLIP-guided language model. In Web Information Systems and Applications: 20th International Conference, WISA 2023, Chengdu, China, 15–17 September 2023; Yuan, L., Yang, S., Li, R., Kanoulas, E., Zhao, X., Eds.; Proceedings (Lecture Notes in Computer Science); Springer: Cham, Switzerland, 2023; Volume 14094, pp. 319–330. [Google Scholar]
- Lv, J.; Hui, T.; Zhi, Y.; Xu, Y. Infrared Image Caption Based on Object-Oriented Attention. Entropy 2023, 25, 826. [Google Scholar] [CrossRef]
- Liu, F.; Gu, L.; Shi, C.; Fu, X. Action Unit Enhance Dynamic Facial Expression Recognition. arXiv 2025, arXiv:2507.07678. [Google Scholar] [CrossRef]
- Wang, L.; Zhang, M.; Jiao, M.; Chen, E.; Ma, Y.; Wang, J. Image Captioning Method Based on CLIP-Combined Local Feature Enhancement and Multi-Scale Semantic Guidance. Electronics 2025, 14, 2809. [Google Scholar] [CrossRef]
- Kim, J.; Lee, D. Facial Expression Recognition Robust to Occlusion and to Intra-Similarity Problem Using Relevant Subsampling. Sensors 2023, 23, 2619. [Google Scholar] [CrossRef] [PubMed]
CNN Based | Transformer Based | ||||||
---|---|---|---|---|---|---|---|
Architecture | Year | Number of Layers | Approximate Number of Parameters | Architecture | Year | Number of Layers | Approximate Number of Parameters |
LeNet-5 [38] | 1998 | 7 | 60,000 | ViT (Vision Transformer) [49] | 2020 | 12–24 | 86 million (ViT-B/16), 307 million (ViT-L/16) |
AlexNet [40] | 2012 | 8 | 60 million | DeiT (Data-efficient Image Transformer) [50] | 2020 | 12–24 | 22 million (DeiT-Small), 86 million (DeiT-Base) |
VGGNet [42] | 2014 | 16–19 | 138 million (VGG16), 144 million (VGG19) | Swin Transformer [26] | 2021 | 18–48 | 28 million (Swin-Tiny), 88 million (Swin-Large) |
GoogLeNet (Inception v1) [43] | 2014 | 22 | 5 million | T2T-ViT [48] | 2021 | 14–24 | 21.5 million (T2T-ViT-14), 64 million (T2T-ViT-24) |
ResNet [44] | 2015 | 18–152 | 11.7 million (ResNet-18), 60 million (ResNet-152) | ConViT (Convolutional Vision Transformer) [51] | 2021 | 12 | 86 million (similar to ViT-B) |
Inception v3 [52] | 2015 | 48 | 23.8 million | LeViT [53] | 2021 | 12–18 | 18 million (LeViT-192), 55 million (LeViT-384) |
DenseNet [54] | 2017 | 121–201 | 8 million (DenseNet-121), 20 million (DenseNet-201) | CvT (Convolutional Vision Transformer) [51] | 2021 | 11–24 | 20 million (CvT-13), 32 million (CvT-21) |
Xception [55] | 2017 | 71 | 22.9 million | CrossViT [56] | 2021 | 18–24 | 105 million (CrossViT-18), 224 million (CrossViT-24) |
MobileNet [57] | 2017 | 28 | 4.2 million (MobileNet V1), 3.4 million (MobileNet V2) | BEiT (BERT Pre-training of Image Transformers) [58] | 2021 | 12–24 | 86 million (BEiT-Base), 307 million (BEiT-Large) |
EfficientNet [59] | 2019 | B0-B7 (scaling) | 5.3 million (B0), 66 million (B7) | CoAtNet [60] | 2021 | Varied | 25 million (CoAtNet-0), 275 million (CoAtNet-4) |
Method | Advantages | Disadvantages |
---|---|---|
SVM | Simple and efficient for binary classification | Not well-suited for multi-object detection |
R-CNN | Accurate and versatile | Computationally expensive |
Fast R-CNN | Reduced computational cost compared to R-CNN | Still two-stage process |
YOLO | Single-stage, high-speed detection | Can be less accurate than two-stage methods |
Faster R-CNN | Combines strengths of R-CNN and Fast R-CNN | More complex than YOLO |
RetinaNet | Highly accurate and efficient | Can be more computationally expensive than YOLO |
DETR (Detection Transformer) | Efficient and end-to-end | Can be less accurate on small objects |
Dataset | Domain | Nb. Images | Nb. Caps (per Image) | PHM Suitability (1–5) |
---|---|---|---|---|
COCO [116] | Generic | 132K | 5 | 4/5: Strong object annotations for machinery detection; lacks emotions/faults—augment with industrial overlays. |
Flickr30K [117] | Generic | 31K | 5 | 3/5: Diverse scenes but limited to static images; moderate for operator-focused PHM. |
Flickr8K [118] | Generic | 8K | 5 | 3/5: Small scale limits generalizability; useful for basic caption training in controlled industrial tests. |
CC3M [119] | Generic | 3.3M | 1 | 4/5: Large volume for scalable training; web-sourced diversity aids variable factory scenes. |
CC12M [120] | Generic | 12.1M | 1 | 4/5: Extensive vocab for detailed fault descriptions; high potential but needs filtering for industrial relevance. |
SBU Captions [121] | Generic | 1M | 1 | 3/5: Good for general learning; limited annotations reduce utility in emotion-fault integration. |
VizWiz [122] | Assistive | 70K | 5 | 4/5: Real-world assistive focus aligns with safety monitoring; adaptable for operator-centric PHM. |
CUB-200 [123] | Birds | 12K | 10 | 2/5: Domain-specific (birds); low relevance to industrial faults or expressions. |
Oxford-102 [124] | Flowers | 8K | 10 | 2/5: Narrow domain; minimal applicability to manufacturing scenes. |
Fashion Cap. [125] | Fashion | 130K | 1 | 2/5: Fashion focused; limited for equipment/object faults in PHM. |
BreakingNews [126] | News | 115K | 1 | 3/5: Narrative style useful for descriptive captions; moderate for event-based faults. |
GoodNews [127] | News | 466K | 1 | 3/5: Large scale for training; news context aids anomaly reporting in PHM. |
TextCaps [128] | OCR | 28K | 5/6 | 3/5: Text-in-image focus; useful for reading machine labels in factories. |
Loc. Narratives [129] | Generic | 849K | 1/5 | 4/5: Long narratives for detailed fault stories; high for complex industrial descriptions. |
LAION-400M [130] | Generic | 400M | 1 | 4/5: Massive scale for robust models; diverse but unfiltered content needs curation. |
LAION-5B [131] | Generic | 5.85B | 1 | 4/5: Extreme size enables advanced training; ideal for handling PHM variability. |
Visual Genome [82] | Generic | 108K | 35 | 4/5: Rich relationships (e.g., human–object interactions) ideal for fault narratives; high potential for Industry 4.0. |
nocaps [132] | Generic | 15.1K | 11 | 3/5: Novel objects test generalization; moderate for unseen industrial anomalies. |
DOCCI [133] | Generic | 15K | 1 | 3/5: Long captions for in-depth analysis; useful but small scale limits scalability. |
VizWiz-Captions [6] | Assistive | 39K | 5 | 4/5: Builds on VizWiz; enhances assistive PHM for safety. |
ROCO/ROCOv2 [134] | Medical | 80K | 1 | 2/5: Medical domain; low direct relevance but adaptable for health-related industrial monitoring. |
MedICaT [135] | Medical | 217K | 1 | 2/5: Detailed medical captions; limited for manufacturing faults. |
SciCap [136] | Scientific | 2M | 1 | 3/5: Scientific focus aids technical descriptions in PHM. |
Recipe1M+ [137] | Food | 13M | 1 | 1/5: Food specific; negligible for industrial applications. |
ArtCap [138] | Art | 3.6K | 5 | 1/5: Art domain; low utility for fault-aware systems. |
STAIR Captions [139] | Generic (JP) | 164K | 5 | 3/5: Multilingual; moderate for global PHM but language barrier. |
Crossmodal-3600 [140] | Multilingual | 3.6K | 73 | 3/5: Multilingual support for international factories; small scale. |
Panda-70M [141] | Video | 70.8M | 1 | 4/5: Video based for dynamic faults; high for real-time PHM monitoring. |
Dataset | Type | Nb. Images/Videos | Nb. Subjects | PHM Suitability (1–5) |
---|---|---|---|---|
FER2013 [142] | Image | 35,887 | N/A | 3/5: Basic emotions for operator stress detection; low due to lab conditions—bias in factory lighting. |
CK+ [143] | Video | 593 seq. | 123 | 2/5: Posed expressions in lab; limited real-world variability for dynamic PHM. |
JAFFE [144] | Image | 213 | 10 | 2/5: Small, controlled; low scalability for industrial emotions. |
RAF-DB [145] | Image | 29,672 | N/A | 4/5: In-the-wild diversity; strong for operator expressions in factories. |
AffectNet [146] | Image | 1M+ | N/A | 4/5: Diverse real-world emotions; high for PHM urgency scoring (e.g., ’alarm’ linked to faults). |
ExpW [147] | Image | 91,793 | N/A | 4/5: Large wild dataset; good for handling factory pose/lighting variations. |
SFEW 2.0 [148] | Image | 1766 | N/A | 3/5: Film sourced; moderate for spontaneous industrial reactions. |
DISFA [149] | Video | 27 videos | 27 | 3/5: Action units for nuanced analysis; lab limits real-time PHM. |
MMI [150] | Both | 2900+ | 75 | 2/5: Controlled; low for wild industrial settings. |
BU-4DFE [151] | 3D/4D | 606 seq. | 101 | 3/5: 3D/4D for depth; useful for occluded factory views but lab based. |
RaFD [152] | Image | 8040 | 67 | 2/5: Posed; limited diversity for PHM. |
Oulu-CASIA [153] | Both | 2880 seq. | 80 | 2/5: Lab-focused; low for variable lighting. |
SAVEE [154] | Audio–Video | 480 | 4 | 2/5: Small, multimodal; minimal for visual-only PHM. |
KDEF [155] | Image | 4900 | 70 | 2/5: Posed angles; low real-world applicability. |
Aff-Wild2 [156] | Video | 558 videos | 458 | 4/5: Continuous emotions in wild; excellent for dynamic operator monitoring. |
FERG [157] | Synthetic | 55,767 | 6 chars | 3/5: Synthetic for augmentation; helps with data scarcity in PHM. |
FACES [158] | Image | 2052 | 171 | 2/5: Age diverse but lab; moderate for operator demographics. |
WSEFEP [159] | Image | 210 | 30 | 2/5: Small; low utility. |
ElderReact [160] | Video | 1323 clips | 30 | 2/5: Elderly focus; niche for specific PHM demographics. |
EmoReact [161] | Video | 360 videos | 63 | 2/5: Child focused; low for adult operators. |
LIRIS-CSE [162] | Video | 208 videos | 208 | 2/5: Child emotions; limited relevance. |
BioVidEmo [163] | Video | 90 videos | 90 | 3/5: Physiological links; useful for stress in PHM. |
Emognition [164] | Multimodal | 387 clips | 43 | 3/5: Multimodal; potential for sensor-fused PHM. |
DDCF [165] | Image | 6000+ | 100+ | 3/5: Diverse subjects; moderate for operator variety. |
InFER++ [166] | Image | 10,000+ | 600 | 4/5: In-the-wild; strong for factory variability. |
BTFER [167] | Image | 2800 | N/A | 4/5: Wild expressions; high for real-time alerts. |
Dataset | Domain | Nb. Images/Frames | Nb. Classes | PHM Suitability (1–5) |
---|---|---|---|---|
COCO [116] | General | 330,000 | 80 | 4/5: Extensive for detecting tools/equipment; integrate with FER for fault-aware scenes. |
Pascal VOC [168] | General | 11,500 | 20 | 3/5: Foundational for object localization; limited classes for industrial machinery. |
ImageNet [169] | General | 14,200,000 | 21,841 | 4/5: Massive scale; good for diverse object training in factories. |
Open Images V7 [170] | General | 1,780,000 | 600 | 4/5: Large annotations; high for scalable PHM models. |
LVIS [171] | Long-tail | 160,000 | 1203 | 3/5: Long-tail objects; useful for rare faults but complex. |
Objects365 [172] | General | 1,740,000 | 365 | 4/5: Dense objects; strong for crowded factory scenes. |
KITTI [173] | Autonomous | 15,000 | 8 | 4/5: Dynamic scenes suit real-time PHM (e.g., vehicle faults); adaptable to factory robotics. |
nuScenes [174] | Autonomous | 1,400,000 | 23 | 4/5: Multi-sensor; high for integrated industrial monitoring. |
Waymo Open [175] | Autonomous | 390,000 | Multiple | 4/5: Advanced annotations; excellent for autonomous PHM systems. |
DOTA v2.0 [176] | Aerial | 11,300 | 18 | 3/5: Aerial views; moderate for overhead factory surveillance. |
xView [177] | Aerial | 1400 | 60 | 3/5: Satellite like; useful for large-scale infrastructure faults. |
ChestX-ray14 [178] | Medical | 112,000 | 14 | 2/5: Medical; low for manufacturing but adaptable for health safety. |
LUNA16 [179] | Medical | 888 CT | 1 | 2/5: CT scans; niche for defect detection analogies. |
MVTec AD [180] | Industrial | 5354 | 15 | 5/5: Industrial anomalies; perfect for fault detection in PHM. |
NEU-DET [181] | Industrial | 1800 | 6 | 5/5: Steel defects; directly relevant for manufacturing faults. |
BDD100K [182] | Autonomous | 100,000 | 10 | 4/5: Driving scenes; adaptable to vehicle/equipment monitoring. |
Cityscapes [183] | Urban | 25,000 | 30 | 3/5: Urban; moderate for infrastructure-related PHM. |
ADE20K [184] | Scene | 25,000 | 150 | 3/5: Scene parsing; useful for environmental context in factories. |
WIDER Face [185] | Face | 32,000 | 1 | 3/5: Face detection; supports FER integration for operators. |
MS-Celeb-1M [186] | Face | 10,000,000 | 100,000 | 3/5: Large faces; good for identity in security–PHM hybrids. |
DIOR [187] | Aerial | 23,500 | 20 | 3/5: Remote sensing; moderate for aerial industrial inspections. |
UAVDT [188] | Aerial+Video | 80,000 | 3 | 4/5: Drone videos; high for dynamic fault surveillance. |
VisDrone [189] | Aerial | 10,000 | 10 | 4/5: Drone-based; strong for overhead monitoring in PHM. |
SKU-110K [190] | Retail | 11,800 | N/A | 2/5: Product detection; low for industrial equipment. |
DeepFashion2 [191] | Fashion | 491,000 | 13 | 2/5: Clothing; minimal relevance to faults. |
HAM10000 [192] | Medical | 10,000 | 7 | 2/5: Skin lesions; niche for defect analogies in materials. |
Study/Year | Methodology | Datasets Used | Key Innovations | Applications |
---|---|---|---|---|
Face-Cap (2019) [112] | CNN (VGGNet) for FER + LSTM decoder with attention | FER-2013; Flickr8k /FlickrFace11K | Emotion probabilities initialize LSTM; face loss for conditioning | Emotional description for accessibility |
CSPDenseNet-based (2022) [193] | FER model + CSPDenseNet for dense features + encoder–decoder | Not specified (general image datasets) | Emotional encoding fused with dense visuals | Sentiment-aware media |
FaceGemma (2023) [194] | Multimodal fine-tuning with attribute prompts | FaceAttrib (CelebA subset) | 40 facial attributes for nuanced captions | Portrait indexing; multilingual aids |
OPCap (2024) [195] | YOLO-tiny object detection + CLIP attribute predictor + Transformer | COCO; nocaps | Object-aware prompting to reduce hallucinations | Image search; smart albums |
Dual-path EVC (2024) [196] | Dynamic emotion perception + adaptive decoder (ResNet/3D-ResNeXt) | EVC-MSVD; EVC-VE; SentiCap | Emotion evolution modules for balanced captions | Video ads; social media engagement |
Motion-CNN (2021) [197] | Faster R-CNN object detection + motion-CNN + LSTM attention | MSR-VTT2016-Image; MSCOCO | Motion features from object regions | Aiding visually impaired; indexing |
BLIP-2-based (2023) [198] | BLIP multimodal fusion + object detection + FER cues | Surveillance datasets (custom complex scenes) | Vision–text integration for scene understanding | Fault monitoring in surveillance/Industry 4.0 |
FocusCap (2023) [199] | CLIP embeddings + pre-trained LM + guided FER attention | COCO; Visual Genome | Unsupervised object-focused captioning with emotional guidance | Real-time industrial diagnostics; accessibility |
Model | Accuracy Gains (e.g., BLEU-4/CIDEr) | Computational Cost (GFLOPs) | Suitability for Industry 4.0 (e.g., Real-Time) |
---|---|---|---|
Face-Cap (2019) [112] | +5–10% emotional metrics | Low (∼10) | Moderate; adaptable for operator monitoring |
FaceGemma (2023) [194] | METEOR +15% on attributes | Medium (∼50) | Low; high for offline analysis |
OPCap (2024) [195] | Hallucination reduction 15–20% | High (∼80) | High; edge devices for fault detection |
Dual-path (2024) [196] | CIDEr +10–15% | Medium (∼60) | Moderate; dynamic for PHM alerts |
BLIP-2 (2023) [198] | SPICE +12–18% in complex scenes | High (∼70) | Moderate; suitable for surveillance-based fault awareness |
FocusCap (2023) [199] | METEOR +10–15% (object-focused) | Medium (∼55) | High; zero-shot for industrial real-time |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Khan, A.S.; Abbass, M.J.; Khan, A.H. Towards Fault-Aware Image Captioning: A Review on Integrating Facial Expression Recognition (FER) and Object Detection. Sensors 2025, 25, 5992. https://doi.org/10.3390/s25195992
Khan AS, Abbass MJ, Khan AH. Towards Fault-Aware Image Captioning: A Review on Integrating Facial Expression Recognition (FER) and Object Detection. Sensors. 2025; 25(19):5992. https://doi.org/10.3390/s25195992
Chicago/Turabian StyleKhan, Abdul Saboor, Muhammad Jamshed Abbass, and Abdul Haseeb Khan. 2025. "Towards Fault-Aware Image Captioning: A Review on Integrating Facial Expression Recognition (FER) and Object Detection" Sensors 25, no. 19: 5992. https://doi.org/10.3390/s25195992
APA StyleKhan, A. S., Abbass, M. J., & Khan, A. H. (2025). Towards Fault-Aware Image Captioning: A Review on Integrating Facial Expression Recognition (FER) and Object Detection. Sensors, 25(19), 5992. https://doi.org/10.3390/s25195992