Submit to Applied Sciences Review for Applied Sciences Propose a Special Issue

Journal Browser

Computer Vision and Pattern Recognition in the Era of Deep Learning

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Published Papers

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (31 December 2019) | Viewed by 92639

Share This Special Issue

Special Issue Editor

Prof. Dr. Athanasios Nikolaidis

E-Mail Website
Guest Editor

Department of Informatics, Computer and Telecommunications Engineering, International Hellenic University, Terma Magnesias Str., 62124 Serres, Greece
Interests: image processing; computer vision; computer graphics; pattern recognition; virtual and augmented reality; multimedia systems and applications
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Deep learning has become a highly popular trend in the machine learning community in recent years, although the term was coined several decades ago. The idea behind deep learning is to try to imitate the function of the human brain by constructing an artificial neural network that has multiple hidden layers, in order to learn better features compared to a conventional shallow network. More precisely, deep learning introduces a hierarchical learning architecture that resembles the layered learning process that takes place in the primary sensory areas of the neocortex in the human brain. It has been shown that by increasing the size of the input dataset, the performance of deep networks increases at a much higher rate than that of shallow networks after a point. This has enabled the practical use of deep neural networks in recent years, since a vast amount of unlabeled multimedia information is now available, and the processing capability of modern computers has risen immensely.

Deep learning and related neural networks such as CNNs and RNNs have already been exploited in a great variety of applications such as automatic text translation, spoken language recognition, music composition, autonomous vehicles, robots, medical diagnosis, stock market prediction, and so on.

An especially popular field of deep learning applications has been that of computer vision and pattern recognition. Typical examples of areas where deep networks have been used are object detection, face detection and recognition, optical character recognition, and image classification. In this Special Issue, we welcome contributions from scholars in all related subjects, presenting either a deep learning solution to a novel application, or a deep learning enhancement to a preexisting application.

Prof. Athanasios Nikolaidis
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

Color restoration
Face detection
Pose estimation
Sentiment recognition
Behavior analysis
Text image translation
Automated lip reading
Image synthesis
Image classification
Handwriting recognition
Object detection
Object classification

Published Papers (14 papers)

Download All Papers

Research

Jump to: Review

13 pages, 859 KiB

Open AccessArticle

Amharic OCR: An End-to-End Learning

by Birhanu Belay, Tewodros Habtegebrial, Million Meshesha, Marcus Liwicki, Gebeyehu Belay and Didier Stricker

Appl. Sci. 2020, 10(3), 1117; https://doi.org/10.3390/app10031117 - 7 Feb 2020

Cited by 14 | Viewed by 9662

Abstract

In this paper, we introduce an end-to-end Amharic text-line image recognition approach based on recurrent neural networks. Amharic is an indigenous Ethiopic script which follows a unique syllabic writing system adopted from an ancient Geez script. This script uses 34 consonant characters with the seven vowel variants of each (called basic characters) and other labialized characters derived by adding diacritical marks and/or removing parts of the basic characters. These associated diacritics on basic characters are relatively smaller in size, visually similar, and challenging to distinguish from the derived characters. Motivated by the recent success of end-to-end learning in pattern recognition, we propose a model which integrates a feature extractor, sequence learner, and transcriber in a unified module and then trained in an end-to-end fashion. The experimental results, on a printed and synthetic benchmark Amharic Optical Character Recognition (OCR) database called ADOCR, demonstrated that the proposed model outperforms state-of-the-art methods by 6.98% and 1.05%, respectively. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition in the Era of Deep Learning)

► Show Figures

Figure 1

11 pages, 3808 KiB

Open AccessArticle

Object Detection with Low Capacity GPU Systems Using Improved Faster R-CNN

by Atakan Körez and Necaattin Barışçı

Appl. Sci. 2020, 10(1), 83; https://doi.org/10.3390/app10010083 - 20 Dec 2019

Cited by 13 | Viewed by 3937

Abstract

Object detection in remote sensing images has been frequently used in a wide range of areas such as land planning, city monitoring, traffic monitoring, and agricultural applications. It is essential in the field of aerial and satellite image analysis but it is also a challenge. To overcome this challenging problem, there are many object detection models using convolutional neural networks (CNN). The deformable convolutional structure has been introduced to eliminate the disadvantage of the fixed grid structure of the convolutional neural networks. In this study, a multi-scale Faster R-CNN method based on deformable convolution is proposed for single/low graphics processing unit (GPU) systems. Weight standardization (WS) is used instead of batch normalization (BN) to make the proposed model more efficient for a small batch size (1 img/per GPU) on single GPU systems. Experiments were conducted on the publicly available 10-class geospatial object detection (NWPU-VHR 10) dataset to evaluate the object detection performance of the proposed model. Experiment results show that our model achieved a 92.3 mAP. This is a 1.7% mAP increase when compared to the best results in the models using the same dataset. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition in the Era of Deep Learning)

► Show Figures

Figure 1

18 pages, 2097 KiB

Open AccessArticle

A New Multi-Scale Convolutional Model Based on Multiple Attention for Image Classification

by Yadong Yang, Chengji Xu, Feng Dong and Xiaofeng Wang

Appl. Sci. 2020, 10(1), 101; https://doi.org/10.3390/app10010101 - 20 Dec 2019

Cited by 14 | Viewed by 3474

Abstract

Computer vision systems are insensitive to the scale of objects in natural scenes, so it is important to study the multi-scale representation of features. Res2Net implements hierarchical multi-scale convolution in residual blocks, but its random grouping method affects the robustness and intuitive interpretability of the network. We propose a new multi-scale convolution model based on multiple attention. It introduces the attention mechanism into the structure of a Res2-block to better guide feature expression. First, we adopt channel attention to score channels and sort them in descending order of the feature’s importance (Channels-Sort). The sorted residual blocks are grouped and intra-block hierarchically convolved to form a single attention and multi-scale block (AMS-block). Then, we implement channel attention on the residual small blocks to constitute a dual attention and multi-scale block (DAMS-block). Introducing spatial attention before sorting the channels to form multi-attention multi-scale blocks(MAMS-block). A MAMS-convolutional neural network (CNN) is a series of multiple MAMS-blocks. It enables significant information to be expressed at more levels, and can also be easily grafted into different convolutional structures. Limited by hardware conditions, we only prove the validity of the proposed ideas through convolutional networks of the same magnitude. The experimental results show that the convolution model with an attention mechanism and multi-scale features is superior in image classification. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition in the Era of Deep Learning)

► Show Figures

Figure 1

13 pages, 14128 KiB

Open AccessArticle

STN-Homography: Direct Estimation of Homography Parameters for Image Pairs

by Qiang Zhou and Xin Li

Appl. Sci. 2019, 9(23), 5187; https://doi.org/10.3390/app9235187 - 29 Nov 2019

Cited by 16 | Viewed by 3473

Abstract

Estimating a 2D homography from a pair of images is a fundamental task in computer vision. Contrary to most convolutional neural network-based homography estimation methods that use alternative four-point homography parameterization schemes, in this study, we directly estimate the

3 \times 3

homography [...] Read more.

3 \times 3

homography matrix value. We show that after coordinate normalization, the magnitude difference and variance of the elements of the normalized

3 \times 3

homography matrix is very small. Accordingly, we present STN-Homography, a neural network based on spatial transformer network (STN), to directly estimate the normalized homography matrix of an image pair. To decrease the homography estimation error, we propose hierarchical STN-Homography and sequence STN-homography models in which the sequence STN-Homography can be trained in an end-to-end manner. The effectiveness of the proposed methods is demonstrated based on experiments on the Microsoft common objects in context (MSCOCO) dataset, and it is shown that they significantly outperform the current state-of-the-art. The average processing time of the three-stage hierarchical STN-Homography and the three-stage sequence STN-Homography models on a GPU are 17.85 ms and 13.85 ms, respectively. Both models satisfy the real-time processing requirements of most potential applications. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition in the Era of Deep Learning)

► Show Figures

Figure 1

18 pages, 7917 KiB

Open AccessArticle

A Vision-Based Method Utilizing Deep Convolutional Neural Networks for Fruit Variety Classification in Uncertainty Conditions of Retail Sales

by Rudnik Katarzyna and Michalski Paweł

Appl. Sci. 2019, 9(19), 3971; https://doi.org/10.3390/app9193971 - 22 Sep 2019

Cited by 32 | Viewed by 4284

Abstract

This study proposes a double-track method for the classification of fruit varieties for application in retail sales. The method uses two nine-layer Convolutional Neural Networks (CNNs) with the same architecture, but different weight matrices. The first network classifies fruits according to images of fruits with a background, and the second network classifies based on images with the ROI (Region Of Interest, a single fruit). The results are aggregated with the proposed values of weights (importance). Consequently, the method returns the predicted class membership with the Certainty Factor (CF). The use of the certainty factor associated with prediction results from the original images and cropped ROIs is the main contribution of this paper. It has been shown that CFs indicate the correctness of the classification result and represent a more reliable measure compared to the probabilities on the CNN outputs. The method is tested with a dataset containing images of six apple varieties. The overall image classification accuracy for this testing dataset is excellent (99.78%). In conclusion, the proposed method is highly successful at recognizing unambiguous, ambiguous, and uncertain classifications, and it can be used in a vision-based sales systems in uncertain conditions and unplanned situations. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition in the Era of Deep Learning)

► Show Figures

Figure 1

11 pages, 3089 KiB

Open AccessArticle

Vision-Based Classification of Mosquito Species: Comparison of Conventional and Deep Learning Methods

by Kazushige Okayasu, Kota Yoshida, Masataka Fuchida and Akio Nakamura

Appl. Sci. 2019, 9(18), 3935; https://doi.org/10.3390/app9183935 - 19 Sep 2019

Cited by 40 | Viewed by 6301

Abstract

This study aims to propose a vision-based method to classify mosquito species. To investigate the efficiency of the method, we compared two different classification methods: The handcraft feature-based conventional method and the convolutional neural network-based deep learning method. For the conventional method, 12 types of features were adopted for handcraft feature extraction, while a support vector machine method was adopted for classification. For the deep learning method, three types of architectures were adopted for classification. We built a mosquito image dataset, which included 14,400 images with three types of mosquito species. The dataset comprised 12,000 images for training, 1500 images for testing, and 900 images for validating. Experimental results revealed that the accuracy of the conventional method using the scale-invariant feature transform algorithm was 82.4% at maximum, whereas the accuracy of the deep learning method was 95.5% in a residual network using data augmentation. From the experimental results, deep learning can be considered to be effective for classifying the mosquito species of the proposed dataset. Furthermore, data augmentation improves the accuracy of mosquito species’ classification. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition in the Era of Deep Learning)

► Show Figures

Graphical abstract

22 pages, 11093 KiB

Open AccessArticle

Motion Capture Research: 3D Human Pose Recovery Based on RGB Video Sequences

by Xin Min, Shouqian Sun, Honglie Wang, Xurui Zhang, Chao Li and Xianfu Zhang

Appl. Sci. 2019, 9(17), 3613; https://doi.org/10.3390/app9173613 - 2 Sep 2019

Cited by 2 | Viewed by 4987

Abstract

Using video sequences to restore 3D human poses is of great significance in the field of motion capture. This paper proposes a novel approach to estimate 3D human action via end-to-end learning of deep convolutional neural network to calculate the parameters of the parameterized skinned multi-person linear model. The method is divided into two main stages: (1) 3D human pose estimation based on a single frame image. We use 2D/3D skeleton point constraints, human height constraints, and generative adversarial network constraints to obtain a more accurate human-body model. The model is pre-trained using open-source human pose datasets; (2) Human-body pose generation based on video streams. Combined with the correlation of video sequences, a 3D human pose recovery method based on video streams is proposed, which uses the correlation between videos to generate a smoother 3D pose. In addition, we compared the proposed 3D human pose recovery method with the commercial motion capture platform to prove the effectiveness of the proposed method. To make a contrast, we first built a motion capture platform through two Kinect (V2) devices and iPi Soft series software to obtain depth-camera video sequences and monocular-camera video sequences respectively. Then we defined several different tasks, including the speed of the movements, the position of the subject, the orientation of the subject, and the complexity of the movements. Experimental results show that our low-cost method based on RGB video data can achieve similar results to commercial motion capture platform with RGB-D video data. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition in the Era of Deep Learning)

► Show Figures

Figure 1

12 pages, 2646 KiB

Open AccessArticle

1D Barcode Detection via Integrated Deep-Learning and Geometric Approach

by Yunzhe Xiao and Zhong Ming

Appl. Sci. 2019, 9(16), 3268; https://doi.org/10.3390/app9163268 - 9 Aug 2019

Cited by 16 | Viewed by 9932

Abstract

Vision-based 1D barcode reading has been the subject of extensive research in recent years due to the high demand for automation in various industrial settings. With the aim of detecting the image region of 1D barcodes, existing approaches are both slow and imprecise. Deep-learning-based methods can locate the 1D barcode region fast but lack an adequate and accurate segmentation process; while simple geometric-based techniques perform weakly in terms of localization and take unnecessary computational cost when processing high-resolution images. We propose integrating the deep-learning and geometric approaches with the objective of tackling robust barcode localization in the presence of complicated backgrounds and accurately detecting the barcode within the localized region. Our integrated real-time solution combines the advantages of the two methods. Furthermore, there is no need to manually tune parameters in our approach. Through extensive experimentation on standard benchmarks, we show that our integrated approach outperforms the state-of-the-art methods by at least 5%. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition in the Era of Deep Learning)

► Show Figures

Graphical abstract

14 pages, 4567 KiB

Open AccessArticle

Periodic Surface Defect Detection in Steel Plates Based on Deep Learning

by Yang Liu, Ke Xu and Jinwu Xu

Appl. Sci. 2019, 9(15), 3127; https://doi.org/10.3390/app9153127 - 1 Aug 2019

Cited by 37 | Viewed by 6140

Abstract

It is difficult to detect roll marks on hot-rolled steel plates as they have a low contrast in the images. A periodical defect detection method based on a convolutional neural network (CNN) and long short-term memory (LSTM) is proposed to detect periodic defects, such as roll marks, according to the strong time-sequenced characteristics of such defects. Firstly, the features of the defect image are extracted through a CNN network, and then the extracted feature vectors are inputted into an LSTM network for defect recognition. The experiment shows that the detection rate of this method is 81.9%, which is 10.2% higher than a CNN method. In order to make more accurate use of the previous information, the method is improved with the attention mechanism. The improved method specifies the importance of inputted information at each previous moment, and gives the quantitative weight according to the importance. The experiment shows that the detection rate of the improved method is increased to 86.2%. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition in the Era of Deep Learning)

► Show Figures

Figure 1

14 pages, 3186 KiB

Open AccessArticle

An Integrated Wildlife Recognition Model Based on Multi-Branch Aggregation and Squeeze-And-Excitation Network

by Jiangjian Xie, Anqi Li, Junguo Zhang and Zhean Cheng

Appl. Sci. 2019, 9(14), 2794; https://doi.org/10.3390/app9142794 - 12 Jul 2019

Cited by 12 | Viewed by 3133

Abstract

Infrared camera trapping, which helps capture large volumes of wildlife images, is a widely-used, non-intrusive monitoring method in wildlife surveillance. This method can greatly reduce the workload of zoologists through automatic image identification. To achieve higher accuracy in wildlife recognition, the integrated model based on multi-branch aggregation and Squeeze-and-Excitation network is introduced. This model adopts multi-branch aggregation transformation to extract features, and uses Squeeze-and-Excitation block to adaptively recalibrate channel-wise feature responses based on explicit self-mapped interdependencies between channels. The efficacy of the integrated model is tested on two datasets: the Snapshot Serengeti dataset and our own dataset. From experimental results on the Snapshot Serengeti dataset, the integrated model applies to the recognition of 26 wildlife species, with the highest accuracies in Top-1 (when the correct class is the most probable class) and Top-5 (when the correct class is within the five most probable classes) at 95.3% and 98.8%, respectively. Compared with the ROI-CNN algorithm and ResNet (Deep Residual Network), on our own dataset, the integrated model, shows a maximum improvement of 4.4% in recognition accuracy. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition in the Era of Deep Learning)

► Show Figures

Figure 1

11 pages, 6998 KiB

Open AccessArticle

CP-SSD: Context Information Scene Perception Object Detection Based on SSD

by Yun Jiang, Tingting Peng and Ning Tan

Appl. Sci. 2019, 9(14), 2785; https://doi.org/10.3390/app9142785 - 11 Jul 2019

Cited by 4 | Viewed by 3434

Abstract

Single Shot MultiBox Detector (SSD) has achieved good results in object detection but there are problems such as insufficient understanding of context information and loss of features in deep layers. In order to alleviate these problems, we propose a single-shot object detection network Context Perception-SSD (CP-SSD). CP-SSD promotes the network’s understanding of context information by using context information scene perception modules, so as to capture context information for objects of different scales. Deep layer feature map used semantic activation module, through self-supervised learning to adjust the context feature information and channel interdependence, and enhance useful semantic information. CP-SSD was validated on benchmark dataset PASCAL VOC 2007. The experimental results show that, compared with SSD, the mean Average Precision (mAP) of the CP-SSD detection method reaches 77.8%, which is 0.6% higher than that of SSD, and the detection effect was significantly improved in images with difficult to distinguish the object from the background. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition in the Era of Deep Learning)

► Show Figures

Figure 1

13 pages, 3343 KiB

Open AccessArticle

Classification Method of Plug Seedlings Based on Transfer Learning

by Zhang Xiao, Yu Tan, Xingxing Liu and Shenghui Yang

Appl. Sci. 2019, 9(13), 2725; https://doi.org/10.3390/app9132725 - 5 Jul 2019

Cited by 8 | Viewed by 2573

Abstract

The classification of plug seedlings is important work in the replanting process. This paper proposed a classification method for plug seedlings based on transfer learning. Firstly, by extracting and graying the interest region of the original image acquired, a regional grayscale cumulative distribution curve is obtained. Calculating the number of peak points of the curve to identify the plug tray specification is then done. Secondly, the transfer learning method based on convolutional neural network is used to construct the classification model of plug seedlings. According to the growth characteristics of the seedlings, 2286 seedlings samples were collected to train the model at the two-leaf and one-heart stages. Finally, the image of the interest region is divided into cell images according to the specification of the plug tray, and the cell images are put into the classification model, thereby classifying the qualified seedling, the unqualified seedling and the lack of seedling. After testing, the identification method of the tray specification has an average accuracy of 100% for the three specifications (50 cells, 72 cells, 105 cells) of the 20-day and 25-day pepper seedlings. Seedling classification models based on the transfer learning method of four different convolutional neural networks (Alexnet, Inception-v3, Resnet-18, VGG16) are constructed and tested. The classification accuracy of the VGG16-based classification model is the best, which is 95.50%, the Alexnet-based classification model has the shortest training time, which is 6 min and 8 s. This research has certain theoretical reference significance for intelligent replanting classification work. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition in the Era of Deep Learning)

► Show Figures

Figure 1

13 pages, 4571 KiB

Open AccessArticle

Multi-Task Learning Using Task Dependencies for Face Attributes Prediction

by Di Fan, Hyunwoo Kim, Junmo Kim, Yunhui Liu and Qiang Huang

Appl. Sci. 2019, 9(12), 2535; https://doi.org/10.3390/app9122535 - 21 Jun 2019

Cited by 1 | Viewed by 5355

Abstract

Face attributes prediction has an increasing amount of applications in human–computer interaction, face verification and video surveillance. Various studies show that dependencies exist in face attributes. Multi-task learning architecture can build a synergy among the correlated tasks by parameter sharing in the shared layers. However, the dependencies between the tasks have been ignored in the task-specific layers of most multi-task learning architectures. Thus, how to further boost the performance of individual tasks by using task dependencies among face attributes is quite challenging. In this paper, we propose a multi-task learning using task dependencies architecture for face attributes prediction and evaluate the performance with the tasks of smile and gender prediction. The designed attention modules in task-specific layers of our proposed architecture are used for learning task-dependent disentangled representations. The experimental results demonstrate the effectiveness of our proposed network by comparing with the traditional multi-task learning architecture and the state-of-the-art methods on Faces of the world (FotW) and Labeled faces in the wild-a (LFWA) datasets. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition in the Era of Deep Learning)

► Show Figures

Figure 1

Review

Jump to: Research

16 pages, 592 KiB

Open AccessReview

A Survey of Handwritten Character Recognition with MNIST and EMNIST

by Alejandro Baldominos, Yago Saez and Pedro Isasi

Appl. Sci. 2019, 9(15), 3169; https://doi.org/10.3390/app9153169 - 4 Aug 2019

Cited by 141 | Viewed by 24733

Abstract

This paper summarizes the top state-of-the-art contributions reported on the MNIST dataset for handwritten digit recognition. This dataset has been extensively used to validate novel techniques in computer vision, and in recent years, many authors have explored the performance of convolutional neural networks (CNNs) and other deep learning techniques over this dataset. To the best of our knowledge, this paper is the first exhaustive and updated review of this dataset; there are some online rankings, but they are outdated, and most published papers survey only closely related works, omitting most of the literature. This paper makes a distinction between those works using some kind of data augmentation and works using the original dataset out-of-the-box. Also, works using CNNs are reported separately; as they are becoming the state-of-the-art approach for solving this problem. Nowadays, a significant amount of works have attained a test error rate smaller than 1% on this dataset; which is becoming non-challenging. By mid-2017, a new dataset was introduced: EMNIST, which involves both digits and letters, with a larger amount of data acquired from a database different than MNIST’s. In this paper, EMNIST is explained and some results are surveyed. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition in the Era of Deep Learning)

► Show Figures