Research Progress on the Aesthetic Quality Assessment of Complex Layout Images Based on Deep Learning

Pu, Yumei; Liu, Danfei; Chen, Siyuan; Zhong, Yunfei

doi:10.3390/app13179763

Open AccessReview

Research Progress on the Aesthetic Quality Assessment of Complex Layout Images Based on Deep Learning

School of Packaging and Materials Engineering, Hunan University of Technology, Zhuzhou 412007, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(17), 9763; https://doi.org/10.3390/app13179763

Submission received: 9 June 2023 / Revised: 26 August 2023 / Accepted: 28 August 2023 / Published: 29 August 2023

(This article belongs to the Topic Computer Vision and Image Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

With the development of the information age, the layout image is no longer a simple combination of text and graphics, but covers the complex layout image obtained from text, graphics, images and other layout elements through the process of artistic design, pre-press processing, typesetting, and so on. At present, the field of aesthetic-quality assessment mainly focuses on photographic images, and the aesthetic-quality assessment of complex layout images is rarely reported. However, the design of complex layout images such as posters, packaging labels, advertisements, etc., cannot be separated from the evaluation of aesthetic quality. In this paper, layout analysis is performed on complex layout images. Traditional and deep-learning-based methods for image layout analysis and aesthetic-quality assessment are reviewed and analyzed. Finally, the features, advantages and applications of common image aesthetic-quality assessment datasets and layout analysis datasets are compared and analyzed. Limitations and future perspectives of aesthetic assessment of complex layout images are discussed in relation to layout analysis and aesthetic characteristics.

Keywords:

deep learning; image aesthetic evaluation; layout analysis; image segmentation

1. Introduction

With the rapid development of technologies such as communication technology, mobile internet, and cloud computing, the dissemination and application of images are growing at an explosive rate on the internet. Image aesthetic quality assessment is gaining more and more attention in fields such as image processing and computer vision [1,2]. The technology for image aesthetic quality assessment is a computer that automatically assesses the aesthetics of an image by calculating the quality of the image and simulating human perceptions and sense of beauty [3]. In recent years, researchers have combined image aesthetics research with techniques such as machine learning, deep learning, and neural networks to explore methods for assessing image aesthetics [4,5,6,7,8]. The complexity of image layout has led to the fact that the aesthetic quality assessment of complex layout images such as posters, brochures, and covers can no longer be achieved just by extracting aesthetic features such as color, sharpness, depth of field, and texture, making it more challenging to assess the aesthetic quality of images in visual communication design categories. In graphic design, the basic design elements include four types: text, graphics, color and layout. While the first three can be referred to as visual elements, layout design is a relatively independent design art [9]. In recent years, the demand for design has expanded in various industries, especially in the field of applied graphic design, such as packaging, advertising posters, website banners and books. There have been researchers who have used AI techniques for graphic design and who have discussed the significance and impact of this work [9,10,11].

Research on image aesthetic quality assessment was first conducted in 2004 in a joint study by Microsoft Research Asia and Tsinghua University for using computers to automatically distinguish images taken by ordinary users from those taken by professional photographers. Currently, research methods for image aesthetic quality assessment fall into two main categories: methods based on hand-designed aesthetic features and methods based on deep learning for image aesthetic quality assessment [12]. The evaluation of the aesthetic quality of an image is mainly carried out by extracting aesthetic features such as color, depth of field, sharpness, exposure, and texture of the image, either manually or by machine. The most representative one is Alibaba’s “Luban System”, which uses intelligent algorithms to assess the aesthetics of designed images and uses the images with the highest aesthetic quality as posters for products. There is also image processing software and video websites that use image aesthetic quality assessment techniques to select highly aesthetically pleasing images. As technology continues to improve, techniques of image aesthetic quality assessment will play an important role in more and more industries.

This paper presents a theoretical study of the aesthetic quality assessment of complex layout images in terms of layout features, image composition features, and combined aesthetic features. For compositional features, most of the current compositional features are mainly based on techniques and rules commonly used in the field of photography, but these rules are single-sided. The essence of composition is the organic combination of different parts of the picture. Therefore, it is necessary to construct a reasonable compositional method that maximizes the retention of image information so that it can correspond to complex layout images of different types and segmentations. Secondly, although deep learning brings performance improvements, it has a high complexity of its own. At this stage, the depth, as well as the complexity of the network model, is increasing and giving better performance, which is a great hindrance to the application of the algorithms in real-world scenarios. Therefore, it is also necessary to optimize the structure of the network with a reasonable trade-off between performance and complexity.

In Section 2 the paper describes the existing methods for layout analysis. Section 3 and Section 4 present the methods for the aesthetic quality assessment of images and the datasets for the assessment of layout images and aesthetic quality, respectively. Section 5 discusses challenges and perspectives for future research. Finally, conclusions are drawn.

2. Complex-Image Layout Analysis Methods

2.1. Analysis of Complex Layout

The layout is a “complex image” composed of text, graphics, images, tables, and other elements, mainly used in the decorative art of books, packaging, posters, etc. [13,14,15]. Layout analysis is the process of automatic analysis, identification, and understanding of the image, text, table information, and position relation in the layout [16,17]. The purpose of layout analysis is to separate text areas, graphics, and background areas in an image and to analyze the positional relationships between the different areas. The process of area segmentation and geometric relationship analysis of an image is called geometric layout analysis. A more detailed logical analysis of the image area is called logical layout analysis, which is a very important step because the text area should be located and segmented before text recognition. The segmentation of complex layout documents is very difficult. The complexity is reflected in the mix of graphics and text, non-rectangular text areas, complex backgrounds (e.g., text overlaid with objects, graphics), handwritten print mixes, etc.

Document images and newspapers belong to the category of complex layout images. This type of image typically has a straightforward background with text, graphics, and tables neatly arranged in the foreground, so it is relatively easy to carry out layout analysis on them, as shown in Figure 1. Compared to complex layout images such as document images and newspapers, images such as posters, brochures, covers, etc., have more complex backgrounds and different foregrounds, and the text in the foregrounds has a variety of fonts, colors, sizes, orientations and textures. This makes it more challenging to analyze the layout of such complex layout images, as shown in Figure 2.

Layout analysis is divided into two modules: layout segmentation and layout classification. At present, OCR technology has been developed for simple layout analysis and has gained some popularity, but with the continuous development of the Internet, layout images are becoming more and more complex and methods of layout analysis are emerging. Complex layout analysis is much more difficult than simple layout in terms of layout segmentation and classification.

2.2. Traditional Layout Analysis Methods

Traditional layout analysis can be broadly classified into three methods: top-down methods [19], bottom-up methods [20], and hybrid methods [21]. With the continuous development of machine learning and neural network technology, emerging analysis methods have appeared in the field of layout analysis, mainly based on machine learning.

The top-down approach is essentially a decomposition process, mainly starting from the page as a whole and recursively segmenting the image, based on global information. Its advantages include simplicity of operation, efficiency, intuitiveness, and better results when the layout features are more intuitive. A drawback to this method is that it is not suitable for analysis of a complex layout image. The most classic of these methods is the projection-based approach [22], which achieves effective segmentation by studying the image projection histogram. Examples include circular projection x-y tangent methods and recursive dichotomous projection methods. For the circular projection x-y tangent method, the image is first projected and the best segmentation point is determined by the x, y direction, so that the image becomes two parts, and then the segmentation continues, cycling through the process until no segmentation position is possible. This method is fast but does not work well when dealing with more complex and skewed layouts. For the recursive dichotomous projection method, the traditional projection method is optimized from the analysis of the projection results to solve the problems associated with through-field projection. Its main idea is based on the projection histogram peak-and-trough distribution state, determining the best original image segmentation. If the scanned page has no through-column and no intuitive peaks and valleys on the projection, the region can be folded in half for projection. When it still cannot be analyzed, the projection area needs to be continued to be reduced until the peak and valley features are generated, and some of the same attribute blocks are subsequently merged, according to the corresponding rules [23].

The bottom-up layout analysis method is based on the idea of image processing, starting from finding the connectivity domains of layout images. Firstly, we obtain all the layout information at the bottom, then merge the connected regions with the same attributes in order to obtain the layout segmentation results, and then identify each segmentation region according to specific features. Its advantages include more comprehensive layout information, suitable for complex layout structures. The disadvantage is the long processing time and low efficiency. The representative algorithms include the connected-region algorithm [24], run-length smoothing algorithm [25], Voronoi diagram algorithm [26], etc. The main idea of the connected domain-based segmentation method is to first find all the connected domains in the image, and then merge them into a larger connected domain based on the intra- and inter-character and line spacing of the text [27,28]. Yu [29] improved the traditional method of connected domains by first expanding the region of individual fonts in the image, then fuzzy integrating the image using the connected spacing statistics, and finally performing the connected-region segmentation of the image. The run-length smoothing algorithm (RLSA) [25] is a pre-processing method for layout segmentation that draws on run-length encoding, after which the processed layout image is divided into isolated sub-regions, and the effective classification and merging of these regions is the key to layout segmentation. Chen [30] proposed a research method based on the adaptive run-length smoothing algorithm. According to the layout structure of the document image, K-mean clustering analysis is used to obtain the run-length threshold applicable to the layout, and run-length smoothing is performed to find the connected region and achieve layout segmentation. In addition, the method proposed by Fu [31] for color-print image layout segmentation based on the connected components of Chinese characters is also a top-down method, which achieves accurate extraction and segmentation of text in complex print images by reconstructing the connected components of Chinese characters.

The two methods mentioned above are premised on neatly arranged images for the purpose of segmentation. The top-down method is faster when dealing with single-text, single-column or multi-column layouts without graphics. For multi-text, multi-column layouts with graphics, the bottom-up method takes advantage of local features, cuts and then combines, and can be used to analysis complex layouts. Therefore, it is possible to combine both methods in the layout analysis process to create a hybrid layout analysis method. These methods include texture analysis algorithms which use texture analysis techniques in image analysis, where the document page is seen as an image with special texture information and different areas of the image show different textures. Texture analysis methods are more computationally intensive and less accurate [32].

This study compares and analyzes the advantages and disadvantages of various representative algorithms in traditional layout analysis methods, as shown in Table 1. Despite the good results achieved by traditional layout analysis methods, layout analysis and table handling are significantly limited by layout differences when using traditional methods for layout analysis, and the generalization effect is flawed when dealing with document images in different scenarios.

2.3. Layout Analysis Method Based on Machine Learning

2.3.1. Layout Analysis Method Based on Support Vector Machine (SVM)

Before neural networks became very popular, traditional machine learning methods were the mainstream methods for layout area localization and recognition. Among them, the method of the support vector machine (SVM) is the more classical machine-learning method. The SVM algorithm is a sample learning technique established by Vapnik [33] based on statistical learning theory. Based on the principle of structural risk minimization, the SVM algorithm uses kernel function technology to complete nonlinear mapping in low-dimensional to high-dimensional space, avoiding overfitting and further improving the generalization ability of the learning machine. At present, the SVM algorithm is gradually being applied to a wide range of functions, from the field of pattern recognition to character recognition, layout and image classification, target detection, and other needs; the SVM algorithm has better performance than the traditional method, and the effect is more ideal. Wang [34] proposed a new SVM-based image-segmentation method, which combined the advantages of a mean clustering algorithm to automatically obtain training samples, and then extracted image color and texture features, respectively, to serve as SVM training samples. Zhou [35] proposed a segmentation method based on a support vector machine to manually select a sample-point set. Through artificial observation of color-feature changes, sample points were selected at the peak of pixels, so that the color difference between the background and the target sample points was obvious, thus achieving the purpose of simplifying sample points and realizing fast segmentation of color images. At the same time, the influence of different kernel function parameters and sample points on the segmentation effect is compared and analyzed. Lu [36] proposed a complex-image-segmentation algorithm based on a support vector machine for the complex-layout-segmentation problem. The method obtains a new combined-feature vector by combining the statistical features of phase consistency and the texture features of the improved grayscale covariance matrix. Finally, the combined vector is used as a training sample to train the algorithm. Wu [37] proposed an algorithm for blade image segmentation based on the support vector machine. The algorithm labelled a small number of pixel points of the image as foreground samples and background samples of the blade, respectively, and built a support-vector-machine classification decision model based on the sample data. Finally, the whole-image pixels are classified according to the prediction model and the target image is segmented from the background. Yang [38] proposed an improved SVM image-segmentation algorithm based on the texture and color of the image as feature vectors, mainly for complex backgrounds and color images with unclear target contours as target images to be segmented.

Table 2 compares and analyses the improved image-segmentation methods based on the SVM algorithm in recent years. At this stage, the SVM algorithm is gradually broadening its scope of application, into areas such as pattern recognition and regression analysis, and other fields are used in the algorithm. From the field of pattern recognition, for the needs of image segmentation, image-edge processing, image recognition and target detection, SVM algorithms have better performance and more ideal results than traditional methods. Despite this, the SVM algorithm still has many shortcomings and is not ideal for solving multi-classification problems, and the classical SVM algorithm is only suitable for binary classification problems. The computation of the algorithm design matrix is difficult to implement for large-scale training samples due to the fact that it consumes a large amount of machine memory and runtime when the matrix order is large, and it is difficult to apply the choice of kernel functions and parameters to the segmentation of most images.

2.3.2. Layout Analysis Based on Neural Networks

Before the rise of neural networks, traditional image-segmentation methods used some low-level semantic information to segment images, which ran fast, had low complexity, and could maintain good edge information. However, in real life, some objects have complex structures with great internal variability, so traditional image-segmentation algorithms can misjudge the background and the target, creating a lack of segmentation accuracy. To solve these problems, researchers have used neural networks to classify all pixels in segmented images to obtain semantic segmentation results.

Based on some classical convolutional classification networks (such as AlexNet, VGGNet, GoogLeNet, ResNet, etc.), Long [39] proposed a framework for image semantic segmentation, Fully Convolutional Networks (FCN). The main idea is to replace the fully connected layer of classical neural networks with a convolutional layer, which can adapt to input- and output-classification images of any size. Since the FCN network model was proposed, image semantic segmentation is beginning to enter the research period of image pixel segmentation. Li [40] proposed a complex document image-segmentation method based on a label pyramid network and deep watershed transform, which can segment document images into instance-sensing regions, where the backbone of the tag pyramid network is the FCN network used. The principle of segmentation is to add the output image of multiple tasks of LPN into a probability graph and perform the watershed transformation on it to segment the document image into the instance-perception region.

Zhao [41] proposed a Pyramid Scene Parsing Network (PSPNet) based on the idea of a spatial pyramid pooling module, which provided a superior framework for pixel-level prediction and achieved good performance in scene-segmentation tasks. Zhou [42] proposed a multi-focus image-segmentation fusion method based on PSPNet to solve the problem whereby traditional multi-focus image fusion methods could not make full use of spatial-context information. The method used PSPNet to extract the focus region of the source image, and ConvCRF to optimize the region and carry out multi-focus image fusion.

The Google team proposed a series of semantic segmentation algorithms the DeepLab series. DeepLabv1 was proposed in 2014 [43] and achieved second place in the segmentation task on the PASCAL VOC2012 dataset; 2017 to 2018 saw the successive proposals of DeepLabv2, DeepLabv3, and DeepLabv3+. Two innovations of DeepLabv1 are the proposed Atrous Convolution and the Fully Connected CRF-based field. The difference between these innovations and DeepLabv2 is that Atrous Spatial Pyramid Pooling (ASPP) has been proposed [44]. DeepLabv3 further optimizes ASPP, including adding 1 × 1 convolution, batch normalization (BN) operation, etc. [45]. DeepLabv3+ adds an upsampling decoder module modeled on the U-Net structure to optimize edge accuracy [46].

Fu [47] solves the scene-segmentation task by acquiring rich contextual dependencies based on a self-attentive mechanism, and proposes a Dual Attention Network (DANet) which adaptively integrates local features and global dependencies. Two types of attention modules are added to the traditional extended FCN to model semantic interdependencies in spatial and channel dimensions, respectively.

Fan [48] proposed a novel and efficient Short-Term Dense Connection (STDC) network architecture, which constituted the basic module of the network by gradually reducing the dimension of feature images and using the aggregation of feature graphs for image representation. A detail aggregation module is proposed in the decoder, which integrates the learning of spatial information into the lower layer using a single-stream method. Finally, the bottom and deep features are fused to predict the final segmentation result.

Wu [49], in order to perform high-quality real-time semantic segmentation, thus proposes the Feature Pyramid Aggregation Network (FPANet). This network can be considered as an encoder–decoder model, where, in the encoder phase, ResNet and Atrous Space Pyramid Pooling (ASPP) are used to extract more advanced semantic information. In the decoder stage, a bilateral-oriented feature pyramid network for semantic segmentation is proposed to simultaneously acquire semantic and spatial information of the image and to fuse different levels of features.

Tang [50] proposed an image semantic segmentation method (DECANet), which introduces a channel attention module to model the dependencies of all channels, improves the expressive ability of the network, selectively learns and strengthens channel features, and utilizes the Atrous Spatial Pyramid Pooling (ASPP) structure. The multi-scale fusion of the extracted image convolution features can reduce the loss of image detail information, and the semantic pixel position information can be extracted without changing the weight parameters, to accelerate the convergence speed of the model. As shown in Table 3, the above neural network-based layout analysis methods are compared.

3. Methods for Assessing the Aesthetic Quality of Images

The current research on image aesthetic quality assessment can be summarized into five tasks, namely aesthetic classification, aesthetic scoring, aesthetic distribution, aesthetic factors and aesthetic description. Aesthetic classification refers to the judgement of an image as “good” and “bad” or as “high” and “low” in aesthetic quality. The aesthetic score is a judgement of the aesthetic-quality score of an image, expressed as a continuous value. The aesthetic distribution is a histogram giving the distribution of aesthetic-quality scores for an image. The aesthetic factor is the evaluation of the image in terms of color scheme, composition, balance, depth of field and many other aspects. An aesthetic description is a description of an image using linguistic comments on aesthetics. The mainstream technology currently used for image aesthetic evaluation are deep neural networks, whose performance far exceeds that of traditional hand-designed aesthetic features.

3.1. Traditional Methods of Assessing the Aesthetic Quality of Images

Most of the research on non-deep-learning-based image aesthetic-quality assessment has focused on extracting aesthetic features of images, mainly by manually constructing features and then using classifiers or extractors. Datta [51] was the first to use image features to achieve a quantitative evaluation of image aesthetic quality. Although non-deep learning methods for evaluating the aesthetic quality of images have achieved some results and proved the feasibility and effectiveness of computable aesthetics, the relevant features still have major shortcomings in representing the aesthetics of images. The following surveys several non-deep-learning-based aesthetic-evaluation methods in recent years.

In the study of image aesthetic-quality assessment, researchers were the first to investigate both image aesthetic features such as composition basis, rule of thirds, and depth of field [52,53] and perceptual features such as color, edge, texture, and semantics [54,55] to achieve aesthetic-quality assessment [2,56]. Wong [53] used visual perception to establish a visual salient region model and proposed a method based on the visual aesthetic-quality-assessment method of images based on salient region enhancement. Some researchers applied techniques related to image processing to extract a series of underlying image features to describe the aesthetic properties of images from the underlying features of images and build aesthetic-assessment models to achieve image aesthetic-quality assessment [51,57,58,59,60]. Bhattacharya [57] used the color attributes, energy distribution, and structural and edge properties of images as aesthetic features of images and achieved prediction of the aesthetic quality of images by training multiple-classification prediction models. Aydin [59] selected sharpness, depth, subject clarity, hue, and color as image aesthetic features and used aesthetic regression to evaluate the aesthetic quality of images. To obtain image features with higher relevance to image aesthetics, researchers tried to evaluate image aesthetics from the perspective of composition and proposed a series of evaluation methods based on composition rules [61,62,63,64,65,66]. Tang [54] used different methods to extract image subject areas and backgrounds with different semantic contents to obtain global features and regional features of images and fuse them to obtain comprehensive and integrated image aesthetic features to train SVMs and complete image aesthetic-quality classification and evaluation. In response to the above methods, researchers have proposed the use of generic image features for image-quality aesthetic evaluation [67,68,69,70,71,72]. Marchesotti [68,69,70] trained a hybrid Gaussian model to simulate the local feature distribution using scale-invariant feature transform (SIFT) descriptors and color descriptors as local descriptors. Local features are encoded using visual bag-of-words [71] as well as Fisher vectors, and the encoded features are concatenated via spatial pyramids to obtain image aesthetic attributes.

The above methods and theories have achieved certain results, but the relevant features still have major defects in representing image aesthetics, and the inquiry into the essence of aesthetics is not deep enough to adequately represent the essence of aesthetics. The evaluation method of image aesthetics based on composition rules also has deficiencies and defects, so the above evaluation methods still have great limitations. In the evaluation of image aesthetic quality, researchers have been trying to find evaluation methods that go beyond the traditional evaluation methods and are closely related to the aesthetic characteristics of images. In recent years, with the gradual development and maturity of technologies such as deep learning and neural networks, researchers have continued to introduce neural networks and deep learning into image aesthetic-quality evaluation, and have achieved a series of results.

3.2. Deep-Learning-Based Method for Assessing the Aesthetic Quality of Images

As early as 1998, LeCun [73] proposed the LeNet-5 network, which applied the error back propagation algorithm to the training of neural network architecture and formed the prototype of contemporary convolutional neural networks. Because of its demonstrated good performance, it has led more and more researchers to apply deep learning methods to image processing problems. As the field of image aesthetic-quality assessment has been intensively researched, researchers have started to feed images directly into neural networks to train their own image aesthetic-quality assessment models. Several basic deep-learning-based approaches to basic image-aesthetic assessment are described below.

3.2.1. Image Aesthetic-Assessment Method Based on Depth-Feature Extraction

The use of deep neural networks to extract image features to achieve the assessment of image aesthetic quality was originally proposed by Datta [51] in 2006, with 56 visual features for image aesthetic assessment. These extracted features include luminance, hue, saturation, etc. Zhang [74] proposed a hierarchical feature fusion model for aesthetic assessment in 2019. It is a two-stream convolutional neural network consisting of two branches with heterogeneous and complementary aesthetic-perception capabilities. The model is designed to learn the mapping from deep-image representations to their true aesthetic labels in an end-to-end manner. Li [75] designed a two-stream network to compute the aesthetic quality of an image, improving on SEReset-50, and proposed five traditional aesthetic-feature-extraction algorithms that are good at extracting luminance, color harmony, rule of thirds, etc. In 2021, Jang [76] analyzed the learning features of deep models and models for aesthetic assessment from three different perspectives (image classification, aesthetic mean, and standard deviation classification) for different tasks.

3.2.2. Aesthetic Assessment Method of Images Based on Multi-Task Convolutional Networks

Most aesthetic-quality assessment methods can only output one type of evaluation results, and their application scope and scenarios are limited. Therefore, to address this problem, Li [77] proposed an end-to-end personality-driven multi-task deep learning model to address this problem in 2019. Image aesthetics and personality traits are learnt from the proposed multi-task model, and then personality traits are used to modulate the aesthetic traits to produce the best generic image-aesthetics score. In 2020, Liu [78] proposed an end-to-end multitasking framework called Aesthetics-Based Saliency Network (ABSNet). Li [77] proposed a personality-assisted multitasking deep learning framework based on an end-to-end personality-driven multitasking deep learning model for generic and personalized image-aesthetic assessment, whose framework is shown in Figure 3. In 2021 Tian [79] used a multitasking residual network model for the feature extraction of images and then evaluated the aesthetic level of images based on the extracted features. Chen [80] used a sentiment-assisted multi-task learning-network approach to improve the performance of image aesthetic-quality assessment based on scene and object branching-based aesthetic assessment.

3.2.3. Image Aesthetic-Assessment Method Based on Fine-Tuned Convolutional Neural Network

Image aesthetic-quality assessment has been a very challenging task due to its subjective and conceptual nature. Wang [81] suggested adapting convolutional neural networks to achieve image aesthetic assessment, using pre-trained models and calibrating them to assess image aesthetic quality. AlexNet and VGG were fine-tuned to provide two types of outputs. VGG is a deeper network that offers higher accuracy, and both global and local views are available to train the network [82]. To address the problem of small-scale data and quantifying the aesthetic quality of images, Li [83] proposed an image content-based convolutional neural network embedding fine-tuning to assess the aesthetic quality of images. The method uses image content to train aesthetic-quality-classification models, but the training samples are becoming smaller, and using fine-tuning only once cannot make full use of small-scale datasets. Therefore, the researchers proposed a method based on two consecutive fine-tunings of aesthetic-quality labels and content labels. In 2022, Xu [84] from the Communication University of China fine-tuned the traditional convolutional neural network VGG16 by treating image aesthetic assessment as an aesthetic-score assignment task, replacing the fully-connected layer with a convolutional layer, and adding an adaptive spatial pooling layer at the end of the network to circumvent the input size limitation of the traditional network.

3.2.4. Aesthetic Assessment Methods for Images in Brain-Inspired Deep Networks

Inspired by the ongoing advances in human visual perception and neuroaesthetic science, Wang [85] designed a brain-inspired deep network to first learn attributes through parallel supervised paths, followed by a high-level synthetic network trained to associate and translate these attributes into overall aesthetic ratings, as shown in Figure 4. Lemarchand [86] extracted low-level features inspired by the human visual system from images to train machine-learning-based systems to classify visual information according to its aesthetics, regardless of the type of visual media, in a new approach. Extensive tests were developed to highlight the strengths and weaknesses of these low-level features while establishing good practice in the field of computational aesthetics research.

3.2.5. Image Aesthetics-Assessment Method Based on Semi-Supervised Adversarial Learning

Traditional methods for image aesthetic assessment suffer from inefficiency, inherent labeling noise, and incompleteness due to semantically described images. Liu [87] proposed a new semi-supervised deep-active-learning algorithm. The algorithm learns human gaze movement paths hierarchically by sequentially linking semantically significant blocks of objects in each scene, unifying semantically significant region discovery and deep-gaze-movement path-feature learning into a principled framework which employs a small set of labelled images. Finally, a probabilistic model for image aesthetic evaluation is developed using the deeply learnt deep-gaze-movement path features. The aesthetic assessment of an image is related to its style, semantics and content. Therefore, Xiang [88] designed a multi-task network in which the aesthetic column was used as a score predictor, the style column as a style classifier, and angular-softmax loss was used to train the main style classifier to maximize the margins between classes in the single-label training data, using a semi-supervised approach to iteratively improve the network’s generalization ability. In 2021, Shu [89] proposed a semi-supervised adversarial learning approach in order to alleviate the problem of annotating aesthetic attributes, which is time-consuming, expensive and error-prone, resulting in usable images partially annotated with attributes. The method is used for image aesthetic evaluation from partially attribute-annotated images, which can effectively reduce the reliance on manual attribute annotations.

3.2.6. Aesthetic Assessment Method of Images Based on Multimodal Attention Networks

In 2020, Zhang [90] proposed a multimodal self-collaborative attention network to address challenges such as the heavy reliance on convolution to extract visual features for image aesthetic assessment and the difficulty in capturing the spatial interaction of visual elements in image synthesis. The self-attentive module computes the response of a location by focusing on all locations in the image, thus enabling spatial interaction of visual elements. To model complex image–text feature relationships, a co-attention module is used to jointly perform text-guided visual attention and visually guided textual attention, and then the participating multimodal features are aggregated and sent to a two-layer multilayer perceptron (MLP) for aesthetic value. Miao [91] 2021 proposed an end-to-end multi-output deep learning based on a multimodal graph convolutional network model for joint image aesthetic and sentiment analysis with joint attention. In the model, a stacked multimodal graph convolutional network is proposed to encode features guided by a correlation matrix and a joint attention module is designed to help image aesthetic and sentiment feature representations learn from each other. Liu [92] divided the aesthetic assessment into multiple modalities, evaluated multiple modalities separately, evaluated each modality more accurately, and finally combined the evaluation of the various modalities to obtain an overall feature vector. A method of aesthetic assessment based on the fusion of multiple modalities is proposed.

The mainstream technique used for image aesthetic-quality assessment at this stage is still based on deep learning methods. Powerful feature representations learnt from large numbers of data have shown increasing performance on recognition, localization, retrieval and tracking tasks, exceeding the capabilities of traditional manual features. However, deep-learning-based image aesthetics-assessment methods have poor interpretability; on the contrary, non-deep learning methods are more interpretable. Non-deep learning approaches rely mainly on low-level features that do not take into account the semantic information of the image and provide a minimal range of aesthetic grades. Deep learning techniques provide better accuracy compared to non-deep learning techniques, focusing on a broader picture of the image, including both low-level features and high-level features. In addition, deep learning methods require large numbers of data for model training, and these datasets are more important than those used in non-deep learning techniques. Deep learning techniques also require more computational resources and time for training and deploying models.

4. Datasets

For datasets for image aesthetic-quality assessment and layout analysis, a large number of relevant benchmark datasets have been open-sourced by academia and industry. This has greatly motivated researchers in related fields to build new algorithmic models, and, in particular, current deep neural network-based models have performed well on these datasets.

4.1. Layout-Analysis Dataset

Most of the current layout-analysis tasks are carried out with document images as the research objects, and therefore most of the datasets proposed by researchers are also specific to document layout images. Table 4 shows the presentation and comparison of a range of document-layout-analysis datasets proposed by researchers in recent years.

PubLayNet is a high-quality document-layout dataset generated by automatically annotating the document layouts of PubMed Central™ PDF articles, which covers typical document layout elements and is suitable for training a model for recognizing scientific document layouts. DocBank is a large-scale dataset built using a weakly supervised approach that enables document-layout-analysis models to integrate textual and visual information for downstream tasks. The IIIT-AR-13K dataset is very effective as training data for solving the detection of graphical objects in business documents and technical articles. ReadingBank, a document-image dataset containing reading-order, text and layout information, is the first ever large-scale dataset and develops the power of neural networks for reading-order detection. TNCR is a dataset based on the analysis of tables in images, which can help in the study of table detection, structure recognition and classification in document-image layout. NewsNet7 is a real-world newspaper-image dataset that is primarily used to analyze the state of various complex layouts in document layouts. LIE is an information-extraction dataset that focuses on the extraction of structural and semantic knowledge from visually rich documents.

These datasets have been constructed primarily for the automatic detection of layout elements such as tables, text, captions, graphics, etc. in document images, and have contributed significantly to the field of document-image-layout analysis. Nevertheless, these datasets still have some flaws and shortcomings. Some of the currently available datasets do not take into account the relationships among the elements of the analyzed layout. The number of data in some datasets is relatively small, which makes it difficult to satisfy the training samples of large-scale deep-neural-network learning models and to solve other problems.

4.2. Image Aesthetic-Quality-Assessment Dataset

Commonly used benchmark datasets for image aesthetic-quality assessment include AVA, AADB, Photo.net, etc. In recent years, researchers have proposed various datasets for image aesthetic-quality assessment, and Table 5 shows the specific introduction to and comparison of each dataset.

Photo.net, AROD datasets are collected through online image-sharing scoring sites and have high-quality data labelling. The AVA, AADB, PCCD, IDEA, and TAD66K datasets were collected using manual scoring. Among these, the AVA dataset is labelled as high quality and supports the learning of aesthetic classification, aesthetic scoring, and aesthetic distribution, but it does not take into account the effects brought by the image-shooting scene, camera parameters, and post processing. The AADB dataset contains dichotomous ratings of eight aesthetic factors (ratings of individual aesthetic factors as “good” or “bad”), but its ratings of aesthetic factors are too simplistic to be used to analyze the subjectivity and diversity of aesthetic ratings. For the first time, the PCCD dataset includes linguistic comment information on multiple aesthetic factors in an image-aesthetics dataset. The PCCD, EVA, IDEA, and AMD-A datasets have too few data to meet the size requirements of large deep neural networks for training samples. The AVA-Reviews dataset contains 40,000 images from the AVA dataset, each with six linguistic comments; however, its data size is still small and the linguistic comment annotation does not take into account multiple aesthetic factors.

5. Summary and Future Prospects

Despite the highly competitive performance of all of the above literature, this area is far from saturated with research. Challenging questions continue to emerge as researchers delve deeper into the field. This paper firstly analyses complex layouts, on the basis of which a review of layout-analysis methods is presented. Secondly, the research progress in the field of image aesthetic-quality evaluation is described. Finally, the commonly used datasets for layout analysis and image aesthetic-quality evaluation are introduced and analyzed, and we will construct a dataset suitable for this study based on the existing datasets in the future. The existing research ideas and technical solutions of the researchers are summarized and analyzed, and an outlook on the future direction of this research area is drawn up.

(1): Building a visual-communication-design-class image dataset

Most of the current open-source complex-layout analysis datasets are for complex layout images such as documents and newspapers, and most of the image aesthetic-quality assessment datasets are also images without text, tables, and other elements; to make the aesthetic quality assessment of complex layout images more accurate and targeted, this paper proposes the establishment of visual-communication-design image datasets, such as posters, covers, brochures and other complex layout images that contain text, graphics, images and other elements.

(2): Modular-design network-model structure

Nowadays, there are many neural network models for complex layout analysis, all of which can achieve good results, but they are used for documents and newspapers with a single background; layout analysis models for visual-communication-design images are rarely reported, so they can also be the future development direction of layout analysis; visual-communication-design images inevitably need to be assessed for aesthetic quality, to judge the quality of the image design and the degree of aesthetics. Aesthetic quality assessment of visual-communication-design images is inevitable in judging the quality and aesthetics of the image design, so it can also be a future direction for the study of aesthetic-quality assessment of images. The modular design of the network-model structure for the two parts of the task can greatly reduce the training time, and the scope of the model can be determined by the model itself.

Author Contributions

Investigation, Y.P.; methodology, Y.P. and Y.Z.; supervision, D.L.; validation, D.L. and S.C.; resources Y.Z.; writing—original draft, Y.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hunan Province; Project Grant No. 2021JJ30218.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Deng, Y.; Loy, C.C.; Tang, X. Image aesthetic assessment: An experimental survey. IEEE Signal Process. Mag. 2017, 34, 80–106. [Google Scholar] [CrossRef]
Luo, P. Social image aesthetic classification and optimization algorithm in machine learning. Neural Comput. Appl. 2023, 35, 4283–4293. [Google Scholar] [CrossRef]
Lu, X.; Lin, Z.; Jin, H.; Yang, J.; Wang, J.Z. Rating image aesthetics using deep learning. IEEE Trans. Multimed. 2015, 17, 2021–2034. [Google Scholar] [CrossRef]
Yang, J.; Zhou, Y.; Zhao, Y.; Lu, W.; Gao, X. MetaMP: Metalearning-Based Multipatch Image Aesthetics Assessment. IEEE Trans. Cybern. 2022, 53, 5716–5728. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Liu, D.; Chang, S.; Dolcos, F.; Beck, D.; Huang, T. Image aesthetics assessment using Deep Chatterjee’s machine. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 941–948. [Google Scholar]
Kao, Y.; He, R.; Huang, K. Deep aesthetic quality assessment with semantic information. IEEE Trans. Image Process. 2017, 26, 1482–1495. [Google Scholar] [CrossRef]
Zhang, X.; Gao, X.; Lu, W.; He, L. A gated peripheral-foveal convolutional neural network for unified image aesthetic prediction. IEEE Trans. Multimed. 2019, 21, 2815–2826. [Google Scholar] [CrossRef]
Apostolidis, K.; Mezaris, V. Image aesthetics assessment using fully convolutional neural networks. In Proceedings of the MultiMedia Modeling: 25th International Conference, Thessaloniki, Greece, 8–11 January 2019; pp. 361–373. [Google Scholar]
Tan, H.; Xu, B.; Liu, A. Research and Extraction on Intelligent Generation Rules of Posters in Graphic Design. In Proceedings of the Cross-Cultural Design. Methods, Tools and User Experience: 11th International Conference, Orlando, FL, USA, 26–31 July 2019; pp. 570–582. [Google Scholar]
Guo, S.; Jin, Z.; Sun, F.; Li, J.; Li, Z.; Shi, Y.; Cao, N. Vinci: An intelligent graphic design system for generating advertising posters. In Proceedings of the CHI Conference on Human Factors in Computing Systems, Virtual (originally Yokohama, Japan), 8–13 May 2021; pp. 1–17. [Google Scholar]
Huo, H.; Wang, F. A Study of Artificial Intelligence-Based Poster Layout Design in Visual Communication. Sci. Program. 2022, 2022, 1191073. [Google Scholar] [CrossRef]
Yang, H.; Shi, P.; He, S.; Pan, D.; Ying, Z.; Lei, L. A comprehensive survey on image aesthetic quality assessment. In Proceedings of the IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS), Beijing, China, 17–19 June 2019; pp. 294–299. [Google Scholar]
Zhang, Y. Layout analysis and understanding. Appl. Linguist. 1997, 2, 94–100. [Google Scholar]
Binmakhashen, G.M.; Mahmoud, S.A. Document layout analysis: A comprehensive survey. ACM Comput. Surv. 2019, 52, 1–36. [Google Scholar] [CrossRef]
Namboodiri, A.M.; Jain, A.K. Document structure and layout analysis. In Digital Document Processing: Major Directions and Recent Advances; Springer: Berlin/Heidelberg, Germany, 2007; pp. 29–48. [Google Scholar]
O’Gorman, L. The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. 1993, 15, 1162–1173. [Google Scholar] [CrossRef]
Ittner, D.J.; Baird, H.S. Language-free layout analysis. In Proceedings of the 2nd International Conference on Document Analysis and Recognition (ICDAR’93), Tsukuba Science City, Japan, 20–22 October 1993; pp. 336–340. [Google Scholar]
Zhong, X.; Tang, J.; Yepes, A.J. Publaynet: Largest dataset ever for document layout analysis. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), Sydney International Convention Centre, Sydney, Australia, 20–25 September 2019; pp. 1015–1022. [Google Scholar]
Nagy, G.; Seth, S.C. Hierarchical representation of optically scanned documents. In Proceedings of the 7th International Conference on Pattern Recognition (ICPR), Montréal, QC, Canada, 30 July–2 August 1984. [Google Scholar]
Mao, S.; Rosenfeld, A.; Kanungo, T. Document structure analysis algorithms: A literature survey. Doc. Recognit. Retr. X 2003, 5010, 197–207. [Google Scholar]
Ha, J.; Haralick, R.M.; Phillips, I.T. Document page decomposition by the bounding-box project. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 2, pp. 1119–1122. [Google Scholar]
Zhu, W.; Chen, Q.; Wei, C.; Li, Z. A segmentation algorithm based on image projection for complex text layout. AIP Conference Proceedings. AIP Publ. LLC 2017, 1890, 030011. [Google Scholar]
Wei, C.; Chen, Q.; Zhang, M. Research on Document Image Layout Segmentation Algorithm Based on Projection. Mod. Comput. 2016, 10, 33–38. [Google Scholar]
Zhan, Y.; Wang, W.; Gao, W. A robust split-and-merge text segmentation approach for images. In Proceedings of the 18th International Conference on Pattern Recognition, Hong Kong, China, 20–24 August 2006; Volume 2, pp. 1002–1005. [Google Scholar]
Strouthopoulos, C.; Papamarkos, N.; Chamzas, C. PLA using RLSA and a neural network. Eng. Appl. Artif. Intell. 1999, 12, 119–138. [Google Scholar] [CrossRef]
Lu, Y.; Tan, C.L. Constructing area Voronoi diagram in document images. In Proceedings of the Eighth International Conference on Document Analysis and Recognition (ICDAR’05), Seoul, Republic of Korea, 29 August–1 September 2005; pp. 342–346. [Google Scholar]
Xiao, F.; Xiao, L. A Chinese document layout analysis based on non-text images. In Proceedings of the 2009 International Forum on Computer Science-Technology and Applications, Chongqing, China, 25 December 2009; Volume 1, pp. 326–328. [Google Scholar]
Guo, L.; Sun, X.; Wang, Z.; Yang, J. A Connectivity-based Page Segmentation Method. Comput. Eng. Appl. 2003, 05, 105–107. [Google Scholar]
Yu, M.; Guo, Q.; Wang, D.; Yu, Y. Improved connectivity-based layout segmentation method. Comput. Eng. Appl. 2013, 49, 195–198. [Google Scholar]
Chen, Y.; Wang, W.; Liu, H.; Cai, Z.; Zhao, P. Layout segmentation and description of Tibetan document images based on adaptive run length smoothing algorithm. Laser Optoelectron. Prog. 2021, 58, 172–179. [Google Scholar]
Fu, L.; Qian, J.; Zhong, Y. Printed image layout segmentation method based on Chinese character connected component. Comput. Eng. Appl. 2015, 51, 178–182. [Google Scholar]
Zujovic, J.; Pappas, T.N.; Neuhoff, D.L. Structural similarity metrics for texture analysis and retrieval. In Proceedings of the 16th IEEE International Conference on Image Processing (ICIP), Cairo, Egypt, 7–10 November 2009; pp. 2225–2228. [Google Scholar]
Vapnik, V.N. An overview of statistical learning theory. IEEE Trans. Neural Netw. 1999, 10, 988–999. [Google Scholar] [CrossRef]
Wang, Y.; Lu, Y.; Li, Y. A new image segmentation method based on support vector machine. In Proceedings of the IEEE 4th International Conference on Image, Vision and Computing (ICIVC), Xiamen, China, 5–7 July 2019; pp. 177–181. [Google Scholar]
Zhou, K.; Qiao, X.; Li, F. Research on color image segmentation based on support vector machine. Mod. Electron. Tech. 2019, 42, 103–106+111. [Google Scholar]
Lu, Y.; Fang, J.; Zhang, S.; Liu, C. Research on layout segmentation based on support vector machine. Mod. Electron. Tech. 2020, 43, 149–153. [Google Scholar]
Wu, Z.; Wang, Q. Leaf image segmentation based on support vector machine. Softw. Eng. 2022, 6, 25. [Google Scholar]
Yang, A.; Bai, Y.; Liu, H.; Jin, K.; Xue, T.; Ma, W. Application of SVM and its Improved Model in Image Segmentation. Mob. Netw. Appl. 2022, 27, 851–861. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 3431–3440. [Google Scholar]
Li, X.H.; Yin, F.; Xue, T.; Liu, L.; Ogier, J.M.; Liu, C.L. Instance aware document image segmentation using label pyramid networks and deep watershed transformation. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 514–519. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–16 July 2017; pp. 2881–2890. [Google Scholar]
Zhou, J.; Hao, M.; Zhang, D.; Zou, P.; Zhang, W. Fusion PSPnet image segmentation based method for multi-focus image fusion. IEEE Photonics J. 2019, 11, 6501412. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062, 2014. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Kuala Lumpur, Malaysia, 18–20 December 2021; pp. 9716–9725. [Google Scholar]
Wu, Y.; Jiang, J.; Huang, Z.; Tian, Y. FPANet: Feature pyramid aggregation network for real-time semantic segmentation. Appl. Intell. 2022, 52, 3319–3336. [Google Scholar] [CrossRef]
Tang, L.; Wan, L.; Wang, T.; Li, S. DECANet: Image Semantic Segmentation Method Based on Improved DeepLabv3+. Laser Optoelectron. Prog. 2023, 60, 92–100. [Google Scholar] [CrossRef]
Datta, R.; Joshi, D.; Li, J.; Wang, J.Z. Studying aesthetics in photographic images using a computational approach. In Proceedings of the Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 288–301. [Google Scholar]
Liu, L.; Chen, R.; Wolf, L.; Cohen-Or, D. Optimizing photo composition. In Computer Graphics Forum; Blackwell Publishing Ltd.: Oxford, UK, 2010; Volume 29, pp. 469–478. [Google Scholar]
Wong, L.K.; Low, K.L. Saliency-enhanced image aesthetics class prediction. In Proceedings of the 16th IEEE International Conference on Image Processing (ICIP), Cairo, Egypt, 7–10 November 2009; pp. 997–1000. [Google Scholar]
Luo, Y.; Tang, X. Photo and video quality evaluation: Focusing on the subject. In Proceedings of the Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, 12–18 October 2008; pp. 386–399. [Google Scholar]
Datta, R.; Li, J.; Wang, J.Z. Algorithmic inferencing of aesthetics and emotion in natural images: An exposition. In Proceedings of the 15th IEEE International Conference on Image Processing, San Diego, CA, USA, 12–15 October 2008; pp. 105–108. [Google Scholar]
Lv, P.; Fan, J.; Nie, X.; Dong, W.; Jiang, X.; Zhou, B.; Xu, M.; Xu, C. User-guided personalized image aesthetic assessment based on deep reinforcement learning. IEEE Trans. Multimed. 2021, 25, 736–749. [Google Scholar] [CrossRef]
Bhattacharya, S.; Sukthankar, R.; Shah, M. A framework for photo-quality assessment and enhancement based on visual aesthetics. In Proceedings of the 18th ACM International Conference on Multimedia, Florence, Italy, 25–29 October 2010; pp. 271–280. [Google Scholar]
Tong, H.; Li, M.; Zhang, H.J.; He, J.; Zhang, C. Classification of digital photos taken by photographers or home users. In Proceedings of the Advances in Multimedia Information Processing-PCM 2004: 5th Pacific Rim Conference on Multimedia, Tokyo, Japan, 30 November–3 December 2004; pp. 198–205. [Google Scholar]
Aydın, T.O.; Smolic, A.; Gross, M. Automated aesthetic analysis of photographic images. IEEE Trans. Vis. Comput. Graph. 2014, 21, 31–42. [Google Scholar] [CrossRef] [PubMed]
Wang, L.; Wang, X.; Yamasaki, T.; Aizawa, K. Aspect-ratio-preserving multi-patch image aesthetics score prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Wu, Y.; Bauckhage, C.; Thurau, C. The good, the bad, and the ugly: Predicting aesthetic image labels. In Proceedings of the 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 1586–1589. [Google Scholar]
Bhattacharya, S.; Sukthankar, R.; Shah, M. A holistic approach to aesthetic enhancement of photographs. ACM Trans. Multimed. Comput. (TOMM) 2011, 7, 1–21. [Google Scholar] [CrossRef]
Dhar, S.; Ordonez, V.; Berg, T.L. High level describable attributes for predicting aesthetics and interestingness. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 1657–1664. [Google Scholar]
Tang, X.; Luo, W.; Wang, X. Content-based photo quality assessment. IEEE Trans. Multimed. 2013, 15, 1930–1943. [Google Scholar] [CrossRef]
Lo, K.Y.; Liu, K.H.; Chen, C.S. Assessment of photo aesthetics with efficiency. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba International Congress Center, Tsukuba Science City, Japan, 11–15 November 2012; pp. 2186–2189. [Google Scholar]
Celona, L.; Leonardi, M.; Napoletano, P.; Rozza, A. Composition and style attributes guided image aesthetic assessment. IEEE Trans. Image Process. 2022, 31, 5009–5024. [Google Scholar] [CrossRef] [PubMed]
Yeh, M.C.; Cheng, Y.C. Relative features for photo quality assessment. In Proceedings of the 19th IEEE International Conference on Image Processing, Orlando, FL, USA, 30 September–3 October 2012; pp. 2861–2864. [Google Scholar]
Marchesotti, L.; Perronnin, F.; Larlus, D.; Csurka, G. Assessing the aesthetic quality of photographs using generic image descriptors. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1784–1791. [Google Scholar]
Marchesotti, L.; Perronnin, F.; Meylan, F. Learning beautiful (and ugly) attributes. In Proceedings of the BMVC, London, UK, 6 September 2013; Volume 7, pp. 1–11. [Google Scholar]
Murray, N.; Marchesotti, L.; Perronnin, F. AVA: A large-scale database for aesthetic visual analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2408–2415. [Google Scholar]
Csurka, G.; Dance, C.; Fan, L.; Willamowski, J.; Bray, C. Visual categorization with bags of keypoints. In Proceedings of the Workshop on Statistical Learning in Computer Vision, Prague, Czech Republic, 11–14 May 2004; Volume 1, pp. 1–2. [Google Scholar]
Wnag, W.; Yi, J.; Xu, X.; Wnag, L. Computational aesthetics of image classification and evaluation. J. Comput. Aided Des. Comput. Graph. 2014, 26, 1075–1083. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Zhang, W.; Zhai, G.; Yang, X.; Yan, J. Hierarchical features fusion for image aesthetics assessment. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–29 September 2019; pp. 3771–3775. [Google Scholar]
Li, X.; Li, X.; Zhang, G.; Zhang, X. A novel feature fusion method for computing image aesthetic quality. IEEE Access 2020, 8, 63043–63054. [Google Scholar] [CrossRef]
Jang, H.; Lee, J.S. Analysis of deep features for image aesthetic assessment. IEEE Access 2021, 9, 29850–29861. [Google Scholar] [CrossRef]
Li, L.; Zhu, H.; Zhao, S.; Ding, G.; Jiang, H.; Tan, A. Personality driven multi-task learning for image aesthetic assessment. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 430–435. [Google Scholar]
Liu, J.; Lv, J.; Yuan, M.; Zhang, J.; Su, Y. ABSNet: Aesthetics-Based Saliency Network Using Multi-Task Convolutional Network. IEEE Signal Process. Lett. 2020, 27, 2014–2018. [Google Scholar] [CrossRef]
Tian, X. Using multi-task residual network to evaluate image aesthetic quality. In Proceedings of the IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 12–14 March 2021; Volume 5, pp. 171–174. [Google Scholar]
Chen, Y.; Pu, Y.; Zhao, Z.; Xu, D.; Qian, W. Image Aesthetic Assessment Based on Emotion-Assisted Multi-Task Learning Network. In Proceedings of the 6th International Conference on Multimedia Systems and Signal Processing, Shenzhen, China, 22–24 May 2021; pp. 15–21. [Google Scholar]
Wang, Y.; Li, Y.; Porikli, F. Finetuning convolutional neural networks for visual aesthetics. In Proceedings of the 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 3554–3559. [Google Scholar]
Wen, K.; Wei, Y.; Dong, X. Survey of application of deep convolution neural network in image aesthetic evaluation. Comput. Eng. Appl. 2019, 55, 13–23+58. [Google Scholar]
Li, Y.; Pu, Y.; Xu, D.; Qian, W.; Wang, L. Image aesthetic quality evaluation using convolution neural network embedded fine-tune. In Proceedings of the CCF Chinese Conference on Computer Vision, Tianjin, China, 11–14 October 2017; pp. 269–283. [Google Scholar]
Wang, W.; Zhao, M.; Wang, L.; Huang, J.; Cai, C.; Xu, X. A multi-scene deep learning model for image aesthetic evaluation. Signal Process. Image Commun. 2016, 47, 511–518. [Google Scholar] [CrossRef]
Wang, Z.; Chang, S.; Dolcos, F.; Beck, D.; Liu, D.; Huang, T.S. Brain-inspired deep networks for image aesthetics assessment. arXiv 2016, arXiv:1601.04155. [Google Scholar]
Lemarchand, F. Doctor of Engineering, Computational Modelling of Human Aesthetic Preferences in the Visual Domain: A Brain-Inspired Approach. Ph.D. Thesis, University of Plymouth, Plymouth, UK, 2018. [Google Scholar]
Liu, Z.; Wang, Z.; Yao, Y.; Zhang, L.; Shao, L. Deep active learning with contaminated tags for image aesthetics assessment. IEEE Trans. Image Process. 2018. early access. [Google Scholar] [CrossRef] [PubMed]
Xiang, X.; Cheng, Y.; Chen, J.; Lin, Q.; Allebach, J. Semi-supervised multi-task network for image aesthetic assessment. Electron. Imaging 2020, 32, 188-1–188-7. [Google Scholar] [CrossRef]
Shu, Y.; Li, Q.; Liu, L.; Xu, G. Semi-supervised Adversarial Learning for Attribute-Aware Photo Aesthetic Assessment. IEEE Trans. Multimed. 2021. [Google Scholar] [CrossRef]
Zhang, X.; Gao, X.; He, L.; Lu, W. MSCAN: Multimodal Self-and-Collaborative Attention Network for image aesthetic prediction tasks. Neurocomputing 2021, 430, 14–23. [Google Scholar] [CrossRef]
Miao, H.; Zhang, Y.; Wang, D.; Feng, S. Multi-Output Learning Based on Multimodal GCN and Co-Attention for Image Aesthetics and Emotion Analysis. Mathematic 2021, 9, 1437. [Google Scholar] [CrossRef]
Liu, X.; Jiang, Y. Aesthetic assessment of website design based on multimodal fusion. Future Gener. Comput. Syst. 2021, 117, 433–438. [Google Scholar] [CrossRef]
Li, M.; Xu, Y.; Cui, L.; Huang, S.; Wei, F.; Li, Z.; Zhou, M. DocBank: A benchmark dataset for document layout analysis. arXiv 2020, arXiv:2006.01038. [Google Scholar]
Mondal, A.; Lipps, P.; Jawahar, C.V. IIIT-AR-13K: A new dataset for graphical object detection in documents. In Proceedings of the Document Analysis Systems: 14th IAPR International Workshop, DAS 2020, Wuhan, China, 26–29 July 2020; pp. 216–230. [Google Scholar]
Wang, Z.; Xu, Y.; Cui, L.; Shang, J.; Wei, F. Layoutreader: Pre-training of text and layout for reading order detection. arXiv 2021, arXiv:2108.11591. [Google Scholar]
Abdallah, A.; Berendeyev, A.; Nuradin, I.; Nurseitov, D. Tncr: Table net detection and classification dataset. Neurocomputing 2022, 473, 79–97. [Google Scholar] [CrossRef]
Zhu, W.; Sokhandan, N.; Yang, G.; Martin, S.; Sathyanarayana, S. DocBed: A multi-stage OCR solution for documents with complex layouts. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022; Volume 36, pp. 12643–12649. [Google Scholar]
Zhang, Z.; Yu, B.; Yu, H.; Liu, T.; Fu, C.; Li, J.; Tang, C.; Sun, J.; Li, Y. Layout-aware information extraction for document-grounded dialogue: Dataset, method and demonstration. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10 October 2022; pp. 7252–7260. [Google Scholar]
Joshi, D.; Datta, R.; Fedorovskaya, E.; Luong, Q.T. Aesthetics and emotions in images. IEEE Signal Process. Mag. 2011, 28, 94–115. [Google Scholar] [CrossRef]
Kong, S.; Shen, X.; Lin, Z.; Mech, R.; Fowlkes, C. Photo aesthetics ranking network with attributes and content adaptation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 662–679. [Google Scholar]
Chang, K.Y.; Lu, K.H.; Chen, C.S. Aesthetic critiques generation for photos. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3514–3523. [Google Scholar]
Schwarz, K.; Wieschollek, P.; Lensch, H.P.A. Will people like your image? Learning the aesthetic space. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 2048–2057. [Google Scholar]
Wang, W.; Yang, S.; Zhang, W.; Zhang, J. Neural aesthetic image reviewer. IET Comput. Vis. 2019, 13, 749–758. [Google Scholar] [CrossRef]
Jin, X.; Wu, L.; Zhao, G.; Li, X.; Zhang, X.; Ge, S.; Zou, D.; Zhou, B.; Zhou, X. Aesthetic attributes assessment of images. In Proceedings of the 27th ACM International Conference on Multimedia, Torino, Italy, 22–26 October 2018; pp. 311–319. [Google Scholar]
Kang, C.; Valenzise, G.; Dufaux, F. Eva: An explainable visual aesthetics dataset. In Joint Workshop on Aesthetic and Technical Quality Assessment of Multimedia and Media Analytics for Societal Trends; Association for Computing Machinery: New York, NY, USA, 2020; pp. 5–13. [Google Scholar]
Jin, X.; Wu, L.; Zhao, G.; Zhou, X.; Zhang, X.; Li, X. IDEA: A new dataset for image aesthetic scoring. Multimed. Tools Appl. 2020, 79, 14341–14355. [Google Scholar] [CrossRef]
He, S.; Zhang, Y.; Xie, R.; Jiang, D.; Ming, A. Rethinking Image Aesthetics Assessment: Models, Datasets and Benchmarks. In Proceeding of the 31st International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022. [Google Scholar]
Jin, X.; Li, X.; Lou, H.; Fan, C.; Deng, Q.; Xiao, C.; Cui, S.; Singh, A.K. Aesthetic attribute assessment of images numerically on mixed multi-attribute datasets. ACM Trans. Multimed. Comput. 2023, 18, 1–16. [Google Scholar] [CrossRef]

Figure 1. Example of document layout analysis [18].

Figure 2. Example of poster layout analysis.

Figure 3. Multi-task learning model architecture [77].

Figure 4. Brain-inspired neural network framework [85].

Table 1. Comparative analysis of representative algorithms in traditional layout analysis methods.

Method		Advantage	Disadvantages	Reference
Top-down	Cyclic projection x-y tangent algorithm	This algorithm has a faster processing speed.	Does not work well with complex and skewed layouts.	[22]
Top-down	Recursive dichotomous projection algorithm	Optimized conventional projection methods.	The algorithm runs inefficiently and is time consuming.	[22]
Bottom-up	Run-Length Smoothing Algorithm	Simple algorithm, stronger noise immunity.	Higher dependence on thresholds and more computationally intensive.	[25]
	Connected-Region Algorithm	Ability to quickly detect connected areas in an image.	Merging rules are difficult to determine and must be processed using a large number of parameters.	[24]
	Voronoi diagram algorithm	Ability to have good reliability and accuracy in electronic-document scenarios.	Does not support image-area splitting, does not work with skewed layout images.	[26]
	Docstrum algorithm	Ability to cope with different text sizes and fonts.	Relies on a set of threshold parameters for clustering and does not support image-region splitting.	[16]
Hybrid	Texture analysis algorithms	The ability to simultaneously process graphics from both the overall and local aspects, not only to adapt to the more complex layout of the text image, and the implementation of high efficiency.	Texture block-size division is more difficult to determine.	[32]

Table 2. Comparative analysis of SVM-based layout analysis methods.

Method	Pub. Year	Advantage	Disadvantages
Wang et al. [34]	2019	SVM combined with mean clustering algorithm for automatic acquisition of training samples.	Not applicable to segmentation of complex layout images.
Zhou et al. [35]	2019	Human observation of the color characteristics of the target and background areas, manual selection of sample points.	Requires manual selection of sample points by hand, which is time-consuming and labor-intensive.
Lu et al. [36]	2020	Combining image phase consistency and texture features to form new feature vectors for layout segmentation.	The accuracy of determining the boundaries between graphs and images is low due to poor graph regularity and high ambiguity.
Wu et al. [37]	2022	Classification of image pixel points by labelling foreground and background samples in the image.	Requires manual selection of sample points by hand, which is time-consuming and labor-intensive.
Yang et al. [38]	2022	The SVM algorithm was improved by adding the hue–saturation-intensity (HIS) color space channel and selecting the RGB and HIS dual color space channels as feature vectors to classify the pixels.	The selected kernel functions and parameters are only applicable to a small number of image segmentations

Table 3. Comparison and analysis of neural network-based layout analysis methods.

Method	Pub. Year	Backbone	Experiments		Major Contributions
Method	Pub. Year	Backbone	Datasets	MIoU (%)	Major Contributions
DeepLabv1 [43]	2014	VGG-16	Pascal VOC 2012	71.6	Atrous convolution, fully connected CRFs
FCN [39]	2015	VGG-16	Pascal VOC 2011	62.7	Pioneer of end-to-end semantic segmentation
PSPNet [41]	2017	VGG-16/ResNet101	Pascal VOC 2012	85.4	Spatial Pyramid Pooling Module
PSPNet [41]	2017	VGG-16/ResNet101	Cityscapes	80.2	Spatial Pyramid Pooling Module
DeepLabv2 [44]	2017	ResNet50	Pascal VOC 2012	79.7	Proposed Atrous Spatial Pyramid Pooling (ASPP)
DeepLabv2 [44]	2017	ResNet50	Cityscapes	70.4	Proposed Atrous Spatial Pyramid Pooling (ASPP)
DeepLabv3 [45]	2017	ResNet101	Pascal VOC 2012	86.9	Cascade or parallel ASPP modules
DeepLabv3 [45]	2017	ResNet101	Cityscapes	81.3	Cascade or parallel ASPP modules
DeepLabv3+ [46]	2018	Xception	Pascal VOC 2012	89.0	Added an upsampled decoder module
DeepLabv3+ [46]	2018	Xception	Cityscapes	82.1	Added an upsampled decoder module
DANet [47]	2019	ResNet101	Pascal VOC 2012	82.6	Dual attention: positional attention module and channel attention module
STDC [48]	2021	STDC2	ImageNet	76.4	Proposed detail-aggregation module to learn the decoder
			Cityscapes	77.0
			CamVid	73.9
FPANet [49]	2022	ResNet18	Cityscapes	75.9	Using ResNet and Atrous Spatial Pyramid Pooling (ASPP) to extract more advanced semantic information
DECANet [50]	2023	ResNet101	Pascal VOC 2012	81.0	Introducing Effective Channel Attention Networks (ECANet) at the Encoder
DECANet [50]	2023	ResNet101	Cityscapes	76.0

Table 4. Comparison of layout-analysis datasets.

Dataset	Year	Total Number of Images	Categories of Layout	Introduce	Reference Link
PubLayNet [18]	2019	360,000	5	The dataset is made up of 33 detailed categories (e.g., tables, images, paragraphs, etc.) and 2 base classes (text and non-text objects). Its layout is annotated with borders and polygon segments.	https://github.com/ibm-aur-nlp/PubLayNet (accessed on 9 June 2023)
DocBank [93]	2020	500,000	12	The DocBank dataset is a document-level benchmark with fine-grained annotation-level annotations for layout analysis. Consisting of 500,000 document pages, it contains 12 types of semantic units.	https://github.com/doc-analysis/DocBank (accessed on 9 June 2023)
IIIT-AR-13K [94]	2021	13,000	5	This dataset is the largest manually annotated dataset for graphical object detection and contains five categories: tables, graphics, natural images, logos, and signatures.	http://cvit.iiit.ac.in/usodi/iiitar13k.php (accessed on 9 June 2023)
Reading Bank [95]	2021	500,000	-	A benchmark dataset for reading-order detection, containing 500 K document images with various document types and corresponding reading-order information.	https://github.com/microsoft/unilm/tree/master/layoutreader (accessed on 9 June 2023)
TNCR [96]	2021	9428	5	The dataset can be used as a base study for table detection, structure recognition, and table classification, and contains five different table classes.	https://github.com/abdoelsayed2016/TNCR_Dataset (accessed on 9 June 2023)
NewsNet7 [97]	2022	3000	7	The dataset contains 3000 fully annotated real newspaper images and is primarily used for layout analysis of various complex-layout documents.	not yet public
LIE [98]	2022	4061	-	The dataset was constructed from 400 documents containing 4061 fully annotated pages and was used primarily for multiple-layout-format analysis.	https://github.com/jsvine/pdfplumber (accessed on 9 June 2023)

Table 5. Comparison of image aesthetic-quality-evaluation datasets.

Dataset	Year	Total Number of Images	Aesthetic Grade	Introduce
Photo.net [99]	2011	20,278	(0,7)	Each image in the dataset has been rated by at least 10 people on a scale from 0 to 7, with 7 being the most aesthetically pleasing image.
AVA [70]	2012	255,530	(1,10)	Each image was rated by 78 to 549 raters with scores ranging from 1 to 10. The average score was used as the truth label for each image. The dataset authors labeled each image with one-to-two semantic tags, for a total of sixty-six semantic tags in text form for the entire dataset.
AADB [100]	2016	10,000	(1,5)	Images are scored by five raters on a scale of 1 to 5, with each image having an overall score and 11 aesthetic-attribute scores.
PCCD [101]	2017	4235	(1,10)	The dataset is more comprehensively labeled, containing evaluation scores, distributions and multi-person verbal comments for one overall and six aesthetic factors, each with a rating range of 1 to 10, ultimately being normalized to [0,1].
AROD [102]	2018	380,000	-	The dataset is calculated by capturing the number of views and comments on each image from Flickr to obtain an aesthetic score.
AVA-Reviews [103]	2018	40,000	-	Each image follows six linguistic comments, which are labeled without regard to multiple aesthetic factors.
DPC-Captions [104]	2019	154,384	-	The dataset contains annotations of up to five aesthetic attributes of an image through knowledge transfer from the fully annotated small-scale dataset PCCD.
EVA [105]	2020	4070	-	A minimum of 30 votes per image was included to assess the difficulty of the aesthetic rating, the rating of the 4 complementary aesthetic attributes, and the relative importance of each attribute in forming the aesthetic opinion.
IDEA [106]	2020	9191	(0,9)	The dataset is distributed almost in equilibrium, with 1000 images for each score from 0 to 8 and 191 images for a score of 9.
TAD66K [107]	2022	66,327	(1,10)	The dataset contains 47 popular themes, and each image has been intensively annotated by more than 1200 people using specialized thematic evaluation criteria.
AMD-A [108]	2023	16,924	(0,1)	The dataset was divided into two groups, one (11,166 images) for the overall aesthetic score regression and the other (16,924 images) for the classification and regression of the aesthetic attribute scores.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pu, Y.; Liu, D.; Chen, S.; Zhong, Y. Research Progress on the Aesthetic Quality Assessment of Complex Layout Images Based on Deep Learning. Appl. Sci. 2023, 13, 9763. https://doi.org/10.3390/app13179763

AMA Style

Pu Y, Liu D, Chen S, Zhong Y. Research Progress on the Aesthetic Quality Assessment of Complex Layout Images Based on Deep Learning. Applied Sciences. 2023; 13(17):9763. https://doi.org/10.3390/app13179763

Chicago/Turabian Style

Pu, Yumei, Danfei Liu, Siyuan Chen, and Yunfei Zhong. 2023. "Research Progress on the Aesthetic Quality Assessment of Complex Layout Images Based on Deep Learning" Applied Sciences 13, no. 17: 9763. https://doi.org/10.3390/app13179763

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research Progress on the Aesthetic Quality Assessment of Complex Layout Images Based on Deep Learning

Abstract

1. Introduction

2. Complex-Image Layout Analysis Methods

2.1. Analysis of Complex Layout

2.2. Traditional Layout Analysis Methods

2.3. Layout Analysis Method Based on Machine Learning

2.3.1. Layout Analysis Method Based on Support Vector Machine (SVM)

2.3.2. Layout Analysis Based on Neural Networks

3. Methods for Assessing the Aesthetic Quality of Images

3.1. Traditional Methods of Assessing the Aesthetic Quality of Images

3.2. Deep-Learning-Based Method for Assessing the Aesthetic Quality of Images

3.2.1. Image Aesthetic-Assessment Method Based on Depth-Feature Extraction

3.2.2. Aesthetic Assessment Method of Images Based on Multi-Task Convolutional Networks

3.2.3. Image Aesthetic-Assessment Method Based on Fine-Tuned Convolutional Neural Network

3.2.4. Aesthetic Assessment Methods for Images in Brain-Inspired Deep Networks

3.2.5. Image Aesthetics-Assessment Method Based on Semi-Supervised Adversarial Learning

3.2.6. Aesthetic Assessment Method of Images Based on Multimodal Attention Networks

4. Datasets

4.1. Layout-Analysis Dataset

4.2. Image Aesthetic-Quality-Assessment Dataset

5. Summary and Future Prospects

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI