You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

2 December 2021

Small-Scale Depthwise Separable Convolutional Neural Networks for Bacteria Classification

and
Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo 182-8585, Japan
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Electronic Solutions for Artificial Intelligence Healthcare Volume II

Abstract

Bacterial recognition and classification play a vital role in diagnosing disease by determining the presence of large bacteria in the specimens and the symptoms. Artificial intelligence and computer vision widely applied in the medical domain enable improving accuracy and reducing the bacterial recognition and classification time, which aids in making clinical decisions and choosing the proper treatment. This paper aims to provide an approach of 33 bacteria strains’ automated classification from the Digital Images of Bacteria Species (DIBaS) dataset based on small-scale depthwise separable convolutional neural networks. Our five-layer architecture has significant advantages due to the compact model, low computational cost, and reliable recognition accuracy. The experimental results proved that the proposed design reached the highest accuracy of 96.28 % with a total of 6600 images and can be executed on limited-resource devices of 3.23 million parameters and 40.02 million multiply–accumulate operations (MACs). The number of parameters in this architecture is seven times less than the smallest model listed in the literature.

1. Introduction

Artificial intelligence (AI) has progressed swiftly from object recognition and detection algorithms to software and hardware’s incredible execution capabilities in recent decades. Image and video classification [,,], natural language processing [], robotics [], and health-care [,] are just a few of the fields where AI-based solutions have surpassed human accuracy and insights. Applying AI and computer vision to biomedical sciences has opened up immense potential for exploring different areas and improving existing medical technology, particularly bacterial recognition. These methods automatically enhance the detection and classification of bacteria species, are highly accurate, reduce cost and time, and avoid researchers’ risk of infection.
Deep-learning approaches, especially deep convolutional neural networks (DCNNs), are currently some of the most notable machine-learning algorithms for dealing with complex tasks that only experienced experts could address in the past. In computer vision or image classification applications, DCNNs may obtain higher accuracy and even exceed non-learning algorithms. The higher accuracy of DCNNs comes from extracting automated high-level features after using statistical learning from a large amount of input training data. Statistical learning supports representing the input space efficiently and well-generalized. However, this capability also requires high computational effort, as well as large memory sizes. When the network size grows exponentially, the respective computational effort and memory size also rise. Due to the personal characteristics of power supply and dimension, these networks are hard to execute on the limited hardware resources of medical devices. Therefore, structural model size reduction [] and parameter optimization [,] are offered to maintain the inference performance of deep neural networks.
Early detection and identification of pathogenic bacteria in food, water, and bodily fluids are essential, yet challenging, owing to sample complexities and large sample volumes that need to be rapidly screened. Existing screening methods based on plate counting or molecular analysis present some tradeoffs regarding detection time, accuracy/sensitivity, cost, and sample preparation complexity. The standard procedure in bacterial detection begins with the collection of various types of test materials. Next, the clinical materials are directly handled on special media (Gram stain, cultivated on medium agar) for 5–20 min. The material incubates under specific temperature conditions: 37 °C, 5–7% CO2 (usually taking 18–24 h). The initial identification of bacteria depends on the assessment of the cell shapes observed under the microscope and the growth rate, type, shape, color, and smell of the colonies (several minutes to 18–24 h). Such analysis allows the assignment to a bacteria type; however, identifying the species is usually impossible due to their significant similarity. Because of that, further analysis consisting of biochemical tests is necessary (16–24 h). As a result, from culture to species identification, the entire diagnostic process can last 2–3 d.
This paper presents a convolutional-neural-network-based approach for the reliably accurate recognition of bacterial species in high-resolution microscopy images for rapidly detecting and classifying related species. The proposed method consists of two crucial stages. Firstly, data augmentation techniques are applied to create a new dataset of species derived from the DIBaS dataset to sufficiently take advantage of the fine features from a large number of bacteria images and to avoid the overfitting issue. Secondly, a small-scale depthwise separable CNN architecture (DS-CNN) for bacteria recognition is built. With three main convolutional layers and an efficient classifier, this model aims to recognize bacteria in the images. The recommended detection and classification method was tested using the partial DIBaS dataset, with 33 various classes of bacteria, and gained a bacterial strain classification rate of 96.28 % . The obtained results still ensure the superiority of the stated method over the state-of-the-art CNN model, but at the same time, fit with low-energy, low-resource devices. More specifically, our design uses only about three million parameters (equivalent to small memory size utilization), about seven-times less than the most efficient model in related papers, and the computational complexity expressed via the MAC operations is forty million. We emphasize the tradeoff between the model’s accuracy reduction (2–3%) and the resource usage of the lowest memory and computation complexity.
This paper has the following main contributions:
  • The DS-CNN was exploited to construct a compact network architecture for the automated recognition and classification of 33 bacteria species in the DIBaS dataset with reliable accuracy and less time consumption;
  • As part of our methodology, we incorporated preprocessing and data augmentation strategies to improve the model’s input quality and achieve higher classification accuracy.
We organized the rest of this paper as follows: In Section 2, you will find a few related articles for the bacterial classification task on the DIBaS dataset using convolutional neural networks. Section 3 briefly introduces C-Conv and DS-Conv, as well as the proposed architectural structure. The content of Section 4 gives the materials and methods that we offer in this study. Setups for the experimentation are discussed in Section 5. Section 6 gives the results of classifying 33 types of bacteria and discusses them. In Section 7, a conclusion and recommendations for further work are offered.

3. Overview of the Depthwise Separable Convolutional Neural Network

This section provides the fundamental structures of the standard convolutional block (C-Conv) and the depthwise separable convolutional block (DS-Conv). We then analyze these two main blocks’ computational complexity to demonstrate that DS-Conv has more computational cost efficiency.

3.1. DS-CNN Layer Primer

3.1.1. Conventional Convolution Block

LeCun et al. [] published the first version of CNN in 1998, and it has been firmly applied to address computer vision (CV) tasks such as image classification, speech recognition, face recognition, and natural language processing. In general, two significant features contributed to the CNN’s success. Firstly, it enhances the recognition rate for its receptive field, similar to human visual cells. Secondly, local connection and weight sharing also significantly reduce the number of network parameters and alleviate overfitting compared with a fully connected deep neural network. In this paper, we refer to the CNN as mentioned above as conventional CNN (C-Conv).
As for conventional convolutions, as shown in Figure 1, each input channel requires a convolution operation such that the number of convolution kernels is the same as the output channel. The result of each output channel is the sum of its corresponding convolution kernels and the convolutional results of all input channels. Assume that the dimension of the input feature is D k × D k × M , where D k , D k , and M are the width, height, and the number of input channels, respectively. Each convolutional layer uses filters of size D f × D f (one channel per filter) with N filters.
Figure 1. The operation of a conventional convolution.
The common sizes of filters in CNNs are 11 × 11 , 5 × 5 , and 3 × 3 . The output is D g × D g × N , where D g , D g , and N are the width, height, and the number of output channels, respectively. Let the total number of trainable parameters in conventional convolution be P C C o n v (without considering bias) and the number of floating-point calculations be C C C o n v in a standard convolution process. They may be computed as shown in Equations (1) and (2) below:
P C C o n v = D f 2 × M × N
C C C o n v = D f 2 × M × D g 2 × N

3.1.2. Depthwise Separable Convolution

Depthwise separable convolution first appeared in L. Sifre’s [] thesis in 2014 and was applied in MobileNet [] and the Xception model [] to replace conventional convolutional layers. DS-Conv is a factorized form of standard spatial convolution. It is composed of depthwise convolution and 1 × 1 convolution (also known as pointwise convolution). The traditional spatial convolution algorithm primarily extracts channelwise features and then combines them to generate new representations. Separately, depthwise and pointwise convolutions can be used to accomplish such two-step tasks.
This is depicted in Figure 2, in which the size of the input image is D k × D k × M , where D k is the height and width of the input image, M is the number of input channels. Each convolutional layer uses filters of size D f × D f × 1 with M filters. When M filters are taken to slide through the input image, one intermediate feature map D g × D g × M is produced by convolving each input feature map with a 2D filter kernel in the depthwise convolution block. It is applied as the input of the next convolution. For the pointwise convolution, the convolution kernels size is 1 × 1; the number of channels on each convolution kernel must be the same as the number of input feature map channels. Let the number of the convolution kernels be N, and then, the output feature map would become D g × D g × N after convolution.
Figure 2. Structure of the depthwise separable convolution block.
As illustrated in Figure 2 with a process of depthwise separable convolution, the parameter P D S C o n v and the floating-point calculation cost C D S C o n v are the sums of the depthwise and 1 × 1 pointwise convolutions. Hence, P D S C o n v and C D S C o n v are calculated as shown in Equations (3) and (4), respectively:
P D S C o n v = D f 2 × M + M × N
C D S C o n v = D f 2 × 1 × D g 2 × M + 1 2 × M × D g 2 × N = ( D f 2 + N ) × D g 2 × M
Therefore, the ratio of parameters r 1 of Equations (1) and (3), the ratio of computation cost r 2 of Equations (2) and (4) between depthwise separable convolution and the normal convolution can be written as:
r 1 = P D S C o n v P C C o n v = D f 2 × M + M × N D f 2 × M × N = 1 N + 1 D f 2 1
r 2 = C D S C o n v C C C o n v = ( D f 2 + N ) × D g 2 × M D f 2 × N × D g 2 × M = 1 N + 1 D f 2 1
due to N 1 and D g 2 > 1 .
It can be clearly seen that the parameters and computational cost are reduced to 1 N + 1 D f 2 compared to the conventional convolution operation. Our study employed a D f × D f = 3 × 3 depthwise convolution filter size and the number of filters ( N = 64 ), so the computation complexity and number of parameters of DS-Conv in each neuron are ∼13-times less than the same neuron in conventional convolution and only tradeoff a slight accuracy loss for the overall architecture. Further, DS-Conv essentially converts continuous multiplication into continuous addition, so the network’s redundancy becomes reduced. As a result, the computational efficiency of the network is greatly improved.

3.1.3. Activation Functions

A differentiable and nonlinear function is applied to the feature map, and then, the result is sent to the subsequence layer as the input. The function is called the activation function. Activation provides nonlinearity to the network and aids high-order polynomials’ learning so that the network can learn and perform a more complex task. There are various types of activation functions, and the most popular kinds that are commonly utilized are sigmoid and rectified linear units (ReLUs):
  • Sigmoid function:
    S i g m o i d x = e x 1 + e x
The sigmoid function is one of the most typical nonlinear activation function with an overall S-shape. The sigmoid function that maps a real number to [ 0 , 1 ] is often used for binary classification. Besides advantages such as gradient smoothing and precise predictions, there are some main drawbacks, including the fact that the sigmoid outputs are not zero-centered, the vanishing gradient problem, where weights in lower layers are virtually unchanged, and the high-cost computation;
  • Rectified linear unit (ReLU):
    R e L U x = x x > 0 0 x 0
    where x is the input to the neuron.
The ReLU layer is a nonlinear operation that is performed after every convolutional layer. Its output is given by m a x ( 0 , x ) . The purpose of ReLU is to introduce nonlinearity in the CNN after the linear operation of convolution since the network needs to learn from real-world data, which are nonlinear, and for the network to generalize or adapt with a variety of data. Compared to sigmoid functions, rectified linear units support faster and more effective training of deep neural architectures and complex datasets. ReLU has several benefits: the number of active neurons is reduced due to a zero in the negative domain and not saturated; it is highly computationally efficient; it speeds up learning; it prevents the vanishing gradient problem.

3.1.4. Batch Normalization

Controlling the input distribution across layers can speed up the training process and improve the accuracy significantly. Accordingly, the distribution of the layer input activation σ , μ is normalized such that it has a zero mean and a unit standard deviation. As visualized in Equation (9), in batch normalization (BN), the normalized value is further scaled and shifted, where the parameters ( γ , β ) are learned from the training process []. ε is a small constant to avoid numerical problems. BN is mainly performed between the CONV or FC layer and the nonlinear function.
y = x u σ 2 + ε γ + β

3.1.5. Pooling Layer

A key aspect of convolutional neural networks is the pooling layer, typically applied after the convolutional layers. Pooling layers (also called subsampling or downsampling) reduce each feature map’s dimensionality, but retain the essential information. For the group of neurons in each receptive field, they return a single value that contains a statistic about the group, e.g., the maximum or the average value. The well-known methods for pooling execution consist of three different types: max, average, and sum. In practice, max pooling has been widely utilized and works better.

3.1.6. Fully Connected Layer

The convolution/pooling process’s output is flattened and transformed into a single vector of values. Each value represents the probability that the features and labels are related. By utilizing the features derived from the process of the previous layer, the fully connected (FC) layer is employed to convert the images to labels. Each neuron prioritizes the tag that corresponds to the received weight. Following that, all neurons will vote on which class should win the classification.

3.1.7. Dropout

Multiple hidden layers are used to learn more complex features, followed by FC layers for decision-making. FC layers are those that are connected to all features and are prone to overfitting. Over-fitting is a problem that occurs when a model is trained and performs so well on the training data that it has a detrimental effect on the model’s performance on new data. The insertion of a dropout layer [] into the model, where some neurons and their connections are randomly removed from the network during training, helps avoid this problem significantly. The network size becomes smaller, and any incoming or outgoing connections to the dropped out node are also terminated.
To avoid overfitting during training, one dropout layer was added to the proposed network. The dropout rate was set at 0.25 and 0.5 .

3.1.8. Classifier Layers

In the last layer, we used the softmax activation function, a popular selection for the final layer in most state-of-the-art deep-learning architectures, to normalize the output of a probability distribution over predicted output classes that sum to one. This function is a generalization of the logistic function to multiple dimensions, and its role is to normalize the output between zero and one. It is frequently employed in both binary and multi-class tasks, provided that each object belongs to one class. The following Equation (10) computes the softmax function:
S o f t m a x x i = e x i j = 0 N e x j i = 1 , 2 , , N
where x i are the elements of the input vector to the softmax function, e x i is the standard exponential function applied to each element of the input vector, and j = 0 N e x j is the normalization term, which ensures that all the output values of the function will sum to one and each will be in the range (0, 1).
We designed a dedicated classifier at the end of the final layer. This classifier consists of 1 FC layer, 1 dropout layer with a rate of 0.25 (which randomly cuts off some connections to reduce the overfitting problem), and 1 Softmax layer for the image classification tasks.

3.1.9. Learning Rate and Optimizers

The learning rate is an essential component for training a CNN. The learning rate is the step size taken into account during training, which speeds up the process. However, choosing the appropriate value for the learning rate is important. If choosing η with a high value, the network may start diverging instead of converging. On the other hand, selecting a smaller value for η will result in more time for the network to converge. In addition, it may quickly become stuck in local minima. The popular solution that addresses this problem is to reduce the learning rate during training. In this article, we set the learning rate to η = 1 × 10 4 .
In convolutional neural networks, non-convex functions often need to be optimized. Mathematical methods require massive computing power, so optimizers are utilized in the training process to minimize the loss function for acquiring optimal network parameters within an acceptable time. Standard optimization algorithms including RMSprop, Adam, Adamax, and Nadam were employed for our model. RMSprop considers only the magnitude of the gradient of the immediately previous iteration. The Adam optimization approach is suggested based on the momentum and the magnitude of the gradient for calculating the adaptive learning rate similar to RMSprop. Adam has improved overall accuracy and helps with efficient training with the better convergence of deep-learning algorithms. These solutions provide some advantages, as well as existing drawbacks, so that each optimizer was verified for the scenarios in the experiment.

3.2. The Proposed Architecture

The proposed block diagram in the bacteria classification process is described in Figure 3. Our architecture consists of four main blocks, from input microscopy images to the classification process, in which the second and third blocks are the main contributions. The data preprocessing block enhances the size and quality of data, as well as resolves imbalances in the dataset and overfitting issues. This block is composed of three stages. Firstly, the input images to be detected were resized from the original dimensions to 224 × 224 , and the number of channels was set at three to fit with the expected input of the model. Then, data augmentation was applied by using transformation to artificially increase the number of images for the training process. Last, the new datasets were split into three sets: training, validation, and testing, with percentages of 70 / 20 / 10 , respectively.
Figure 3. Schematic overview of the network architecture.
The schematic overview of the proposed model is intelligibly depicted with five crucial layers, in which the first three convolutional layers extract image features with a depth of sixty-four and filter sizes of 3 × 3 and 1 × 1 . After that, one FC layer and one softmax function are used for data flattening and bacteria species classification, respectively. In addition, we inserted BN and dropout layers to normalize the data and avoid overfitting during training.
Figure 4a,b illustrates the internal structure of C-Conv and DS-Conv in detail, respectively. From Figure 4a, C-Conv only carries out a Conv2D operation, then passes through the BN layer, and finally enters into the activation function ReLU layer to obtain the output of C-Conv. Similarly, in Figure 4b, DS-Conv carries out the DW-Conv operation first, then penetrates the ReLU layer to obtain the output of DW-Conv. After that, the output of the DW-Conv layer is input into the PW-Conv layer. PW-Conv performs a pointwise convolution operation with a 1 × 1 filter size, then passes through the activation function ReLU layer to obtain the output of PW-Conv. The output of PW-Conv is that of DS-Conv. Here, BN after C-Conv helps accelerate deep network training by reducing the internal covariate shift [], normalizing the input data x to [ 0 , 1 ] , and conforming to the standard normal distribution. In addition, ReLU was utilized as an activation function.
Figure 4. Internal structure. (a) C-Conv ; (b) DS-Conv.
Table 1 represents a small-scale CNN architecture based on depthwise separable convolution (DS-Conv) with its detailed specification. Following the conventional convolutional layer (C-Conv), the BN layer, and the max pooling layer, the depthwise convolution (DW-Conv), pointwise convolution (PW-Conv), and max pooling ( M a x _ p o o l ) layers are added. After two consecutive depthwise separable and max pooling layers, there are FC layers and the softmax classifier to classify 33 outputs. We also calculated the trainable parameters (weight and bias) and the computation cost (multiply–accumulate operations (MACs)) of the design.
Table 1. The proposed architecture specification.

4. Materials and Methods

4.1. Dataset

The Digital Images of Bacteria Species dataset [] consists of 33 bacteria species, each with a total of 20 images in the dataset. The Chair of Microbiology at the Jagiellonian University in Krakow, Poland, was in charge of collecting the samples. Figure 5 shows some of the images captured in this dataset, which have original dimensions of 2048 × 1532 × 3 . All samples were stained using Gram’s process. The images were collected via an Olympus CX31 Upright Biological Microscope paired with an SC30 camera for this project (Olympus Corporation, Japan). Their performance was assessed using a 100-fold objective while submerged in oil (Nikon50, Shinagawa Intercity Tower C, 2-15-3, Konan, Minato-ku, Tokyo 108-6290, Japan). Researchers interested in the bacterial colony domain are able to use the DIBaS dataset on a public-access basis.
Figure 5. Bacteria samples in the DIBaS dataset [].

4.2. Dataset Augmentation

Deep CNNs are particularly dependent on the availability of large quantities of training data. However, due to the small number of images in the biomedical domain, it is hard to meet the massive input data requirements of CNNs. Another issue when using a small amount of data for CNNs is that it leads to overfitting. A sophisticated solution to alleviate the relative scarcity of the data compared to the number of parameters involved in CNNs is data augmentation []. Data augmentation consists of transforming the available data into new data without altering its nature. The computer will detect that the modified model is a different image, but humans still know that the modified image is the same picture. Simple geometric transformations such as sampling [], mirroring [], rotating [], shifting [], and various photometric transformations [] are popular augmentation methods.
The number of images in the dataset was significantly increased by using transformations that do not change the classes. Each image was transformed according to the steps mentioned below:
  • r o t a t i o n _ r a n g e is a value in the range of 0 0 to 180 0 within which to rotate pictures randomly; 40 0 was the random value selected;
  • w i d t h _ s h i f t = 0.2 , and h e i g h t _ s h i f t = 0.2 are thresholds (as a fraction of total width or height) within which to randomly shift images vertically or horizontally;
  • s h e a r _ r a n g e is for randomly employing shearing transformations. It is 0.2;
  • z o o m _ r a n g e = 0.2 is for randomly zooming the picture sizes;
  • h o r i z o n t a l _ f l i p is for randomly horizontally reversing the pixels of the image;
  • f i l l _ m o d e = r e f l e c t is the strategy used for filling in newly created pixels, which can appear after a rotation or a width/height shift.
In addition, some CV functions were exploited in the image augmentation and to open the files in the class folders, as well as save them as . t i f files.
As a result, we made 6600 images from the number of original appearances, then utilized histogram equalization techniques to check and maintain the valuable information in the new samples.

5. Experimental Setups

5.1. Training Strategies

Our model was trained and tested on the computational platform of a 64-bit Windows 10 computer with an Intel Core i9 processor (3.6 GHz), 32 GB RAM, and an NVIDIA GeForce RTX 2080 SUPER graphics card. The TensorFlow framework [] and Keras libraries [] are the foundations of the Python programming language.
Following preprocessing, the new dataset contained a total of 6600 bacteria images, of which 70% was allocated to the training set and 20% to the validation set and the remainder for test set. The learning rate hyperparameter, which controls the speed of weight update, was set to η = 1 × 10 4 , and the weight of the filters was randomly initialized and automatically updated. The experimental setup for the training process is given in Figure 6. The suggested architecture was trained from the scratch.
Figure 6. Research methodology.

5.2. Parameter Selection

The fundamental requirement for proper neural network training is the correct selection of the hyperparameters. For this, we employed grid search optimization on the training set with five-fold cross-validation (Figure 7). We verified some activation functions, including ReLU and sigmoid, for the classifier’s fully (densely) connected layer. The dropout rate, which indicates how many input units are dropped, varied between 0.25 and 0.5. Table 2 lists the parameters that created the most promising results.
Figure 7. Five-fold cross-validation scheme.
Table 2. Hyperparameters and training settings for the suggested models.

6. Results and Discussion

This section consists of two subsections. In the first subsection, we list some evaluation metrics for the proposed model. Then, we draw a confusion matrix to show the test results on the test set. Finally, a performance comparison table between our method and other studies on the DIBaS dataset is made.

6.1. Statistical Analysis

The evaluation metric plays an essential role in achieving the optimal classifier during the classification training. Thus, determining proper evaluation metrics is a primary key for discriminating against and obtaining the expected results. The model’s performance is often measured using the parameters of accuracy, precision, sensitivity, specificity, and f1 obtained from the confusion matrix (CM) calculation. These metrics, together with the receiver operating characteristic (ROC) curve, area under the curve (AUC), and precision–recall (PR) curve, are widely used as verification standards in machine learning and have been successfully applied in many biomedical studies [,,,] with high evidence. ROC, AUC, and PR, on the other hand, are better suited to binary classification tasks or a small number of outputs; as a result, they may face significant challenges when describing many outcomes. As a result, in this article, we chose the CM, as well as the accuracy, precision, and F1 as the evaluation metrics.
The values and terms in the CM are shown in Table 3. TP and TN are the correctly classified positive or negative data; FP is negative, but positively classified data; while FN is positive data, but classified as negative. After obtaining the CM values, we calculated the equations’ sensitivity, precision, and F1 indices to evaluate the stated model following several scenarios in Table 2.
Table 3. Confusion matrix values.
  • Accuracy is the most straightforward metric that the model evaluation process requires to quantify a model’s performance. Accuracy is the fraction of correct predictions and the overall number of forecasts. The formula for calculating accuracy is written by:
    A c c u r a c y = ( N u m b e r _ o f _ c o r r e c t _ p r e d i c t i o n s ) ( T o t a l _ n u m b e r _ o f _ p r e d i c t i o n s ) ;
  • Precision is an evaluation metric that describes a fraction of the true positive prediction and the total number of positive predictions. Precision refers to the frequency with which we are correct when the predicted value is positive:
    P r e c i s i o n = T P T P + F P ;
  • Sensitivity is the ratio of positive predictions to the total actual number of positives. Sensitivity is also referred to as the recall or true positive rate. Sensitivity means how often the forecast is correct when the real value is positive.
    S e n s i t i v i t y = T P T P + F N ;
  • Specificity is calculated by dividing the total number of negative predictions by the total number of actual negatives. Specificity is also understood as the true negative rate. The term “specificity” refers to the frequency with which a prediction is correct when the actual value is negative.
    S p e c i f i c i t y = T N T N + F P ;
  • The F 1 -score, alternatively called the balanced F-score or F-measure, can be calculated as a weighted average of the precision and sensitivity:
    F 1 = 2 × P r e c i s i o n × S e n s i t i v i t y P r e c i s i o n + S e n s i t i v i t y .
The detailed performance analysis of the confusion matrix for Fold 5, which was obtained using validation data, is given in Figure 8. The confusion matrix shows that almost all the bacteria images were classified correctly by the recommended pretrained model except for some bacteria images, Enterococcus faecalis, Enterococcus faecium, and Bacteroides fragilis. The model predictions for misclassified bacteria images were Staphylococcus saprophyticus and Staphylococcus aureus, respectively. This mistaken classification can be explained due to the relatively similar color and shape of these bacteria samples and insufficient training images for the model to extract and classify.
Figure 8. Bacteria classification confusion matrix using 5-fold cross-validation.
Table 4 illustrates the comparison of the performance of different optimizers. The most effective choices for data augmentation were the recently popular Nadam and RMSprop, which achieved a maximum of 96.28 % accuracy, as well as high precision and sensitivity. Without data augmentation, the optimizers were significantly worse in terms of accuracy, with Adamax and Adam reaching an 86.42 % and 86.24 % accuracy, respectively.
Table 4. Model performance (%).

6.2. Comparison with Other Studies

Figure 9 presents the recent state-of-the-art results conducted on DIBaS image classification.
Figure 9. Recent state-of-the-art studies on DIBaS image classification.
Table 5 depicts a comparison of the results between studies when deployed on the full DIBaS dataset in terms of model structure, number of layers, number of parameters, classification accuracy, and data preprocessing methods. As observed from the figure and table, we concluded that in the present study, the proposed architecture consumed the lowest resources at a bit lower accuracy level compared to the other studies. The analysis suggested that ResNet-50 without using data augmentation had the highest accuracy, while the number of parameters utilized was medium. Nasip et al. and Khalifa et al. announced the same accuracy (∼98.2% ); however, Nasip used many more parameters and input images than Khalifa’s study to obtain these results. It has been proven that the Khalifa method is better than Nasip’s. Our work employed C-Conv and DS-Conv combinations to obtain the same performance (about 96.3 % ) as Zielinski’s work and slightly lower (∼3%) than other works.
Table 5. Comparative analysis among the studies conducted on the DIBaS dataset.
On the other hand, the parameters (3.23 M) and model size (five layers) of our design were the smallest. The optimization caused it to have convolutional computation and good data augmentation. This approach could pave the way for deep-learning algorithms to be integrated directly in low-resource devices in biomedical fields.

7. Conclusions and Future Work

The experimental results demonstrated that the recommended five-layer DS-CNN architecture has the broad potential to be deployed in medical imaging analysis tasks, especially for small datasets. The CNN variants can improve medical imaging technology further and strengthen its capabilities, providing a higher level of automation while speeding up processes and increasing productivity. This study gained a 96.28 % accuracy in bacterial strain classification, as well as the usage of low trainable parameters ( 3.23 M) and less computational complexity ( 40.02 M). This led to consuming less energy and having a slight accuracy tradeoff (∼3%) to fit limited-resource devices.
This algorithm is weakened by the necessity of manually labeled data, massive data and training time requirements to be pretrained, and sometimes the incorrect prediction of the same bacteria species. The network might inherit faults from a specialist, as the correct judgment of a cell is, in many cases, difficult even for an experienced human. A more extensive dataset labeled by a larger group of experts is one solution to mitigate this limitation. Future research will mainly focus on expanding and processing the dataset [], optimizing the key blocks (DS-Conv) [], and fine-tuning the last layers. We also plan to implement our dedicated architecture on a hardware platform to utilize robust parallel computation and low energy.

Author Contributions

Conceptualization, K.I. and D.-T.M.; Methodology, D.-T.M. and K.I.; Analysis and interpretation, D.-T.M. and K.I.; Investigation, all of the authors; Writing original draft preparation, D.-T.M.; Supervision, K.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial intelligence
AUCArea under the curve
BNBatch normalization
C-CNNsConventional CNNs
CMConfusion matrix
CVComputer vision
DS-CNNsDepthwise separable CNNs
DCNNsDeep convolutional neural networks
DIBaSDigital Images of Bacteria Species
FCFully connected
LSTMLong short-term memory
MACMultiply–accumulate
GPUGraphics processing unit
PR CurvePrecision-recall curve
ReLURectified linear unit
RFRandom forest
ROCReceiver operating characteristic

References

  1. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
  2. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
  3. Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos. arXiv 2014, arXiv:1406.2199. [Google Scholar]
  4. Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural Language Processing (Almost) from Scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar]
  5. Zhang, T.; Kahn, G.; Levine, S.; Abbeel, P. Learning deep control policies for autonomous aerial vehicles with MPC-guided policy search. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 June 2016; IEEE: Stockholm, Sweden, 2016; pp. 528–535. [Google Scholar] [CrossRef] [Green Version]
  6. Zhou, J.; Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 2015, 12, 931–934. [Google Scholar] [CrossRef] [Green Version]
  7. Alipanahi, B.; Delong, A.; Weirauch, M.T.; Frey, B.J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015, 33, 831–838. [Google Scholar] [CrossRef] [PubMed]
  8. Zhuang, B.; Shen, C.; Tan, M.; Liu, L.; Reid, I. Structured Binary Neural Networks for Accurate Image Classification and Semantic Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 413–422. [Google Scholar] [CrossRef] [Green Version]
  9. He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; Li, M. Bag of Tricks for Image Classification with Convolutional Neural Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 558–567. [Google Scholar] [CrossRef] [Green Version]
  10. Nuriel, O.; Benaim, S.; Wolf, L. Permuted AdaIN: Reducing the Bias towards Global Statistics in Image Classification. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
  11. Nie, D.; Shank, E.A.; Jojic, V. A deep framework for bacterial image segmentation and classification. In Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics, Atlanta, Georgia, 9–12 September 2015; pp. 306–314. [Google Scholar] [CrossRef]
  12. Ates, H.; Gerek, O.N. An image-processing based automated bacteria colony counter. In Proceedings of the 2009 24th International Symposium on Computer and Information Sciences, Guzelyurt, Cyprus, 14–16 September 2009; pp. 18–23. [Google Scholar] [CrossRef]
  13. Divya, S.; Dhivya, A. Human Eye Pupil Detection Technique Using Circular Hough Transform. Int. J. Adv. Res. Innov. 2019, 7, 3. [Google Scholar]
  14. Limare, N.; Lisani, J.L.; Morel, J.M.; Petro, A.B.; Sbert, C. Simplest Color Balance. Image Process. Line 2011, 1, 297–315. [Google Scholar] [CrossRef] [Green Version]
  15. Ganesan, P.; Sajiv, G. A comprehensive study of edge detection for image processing applications. In Proceedings of the 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), Coimbatore, India, 17–18 March 2017; pp. 1–6. [Google Scholar] [CrossRef]
  16. Xuan, L.; Hong, Z. An improved canny edge detection algorithm. In Proceedings of the 2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 24–26 November 2017; pp. 275–278. [Google Scholar] [CrossRef]
  17. Hiremath, P.S.; Bannigidad, P. Automatic Classification of Bacterial Cells in Digital Microscopic Images. In Proceedings of the Second International Conference on Digital Image Processing, Singapore, 26–28 February 2010; 2010; p. 754613. [Google Scholar] [CrossRef]
  18. Ho, C.S.; Jean, N.; Hogan, C.A.; Blackmon, L.; Jeffrey, S.S.; Holodniy, M.; Banaei, N.; Saleh, A.A.E.; Ermon, S.; Dionne, J. Rapid identification of pathogenic bacteria using Raman spectroscopy and deep learning. Nat. Commun. 2019, 10, 4927. [Google Scholar] [CrossRef]
  19. Kang, R.; Park, B.; Eady, M.; Ouyang, Q.; Chen, K. Single-cell classification of foodborne pathogens using hyperspectral microscope imaging coupled with deep learning frameworks. Sens. Actuators B Chem. 2020, 309, 127789. [Google Scholar] [CrossRef]
  20. Sajedi, H.; Mohammadipanah, F.; Pashaei, A. Image-processing based taxonomy analysis of bacterial macromorphology using machine-learning models. Multimed. Tools Appl. 2020, 79, 32711–32730. [Google Scholar] [CrossRef]
  21. Tamiev, D.; Furman, P.E.; Reuel, N.F. Automated classification of bacterial cell sub-populations with convolutional neural networks. PLoS ONE 2020, 15, e0241200. [Google Scholar] [CrossRef]
  22. Mhathesh, T.S.R.; Andrew, J.; Martin Sagayam, K.; Henesey, L. A 3D Convolutional Neural Network for Bacterial Image Classification. In Intelligence in Big Data Technologies—Beyond the Hype; Peter, J.D., Fernandes, S.L., Alavi, A.H., Eds.; Springer: Singapore, 2021; Volume 1167, pp. 419–431. [Google Scholar] [CrossRef]
  23. Korzeniewska, E.; Szczęsny, A.; Lipiński, P.; Dróżdż, T.; Kiełbasa, P.; Miernik, A. Prototype of a Textronic Sensor Created with a Physical Vacuum Deposition Process for Staphylococcus aureus Detection. Sensors 2020, 21, 183. [Google Scholar] [CrossRef]
  24. Zieliński, B.; Plichta, A.; Misztal, K.; Spurek, P.; Brzychczy-Włoch, M.; Ochońska, D. Deep learning approach to bacterial colony classification. PLoS ONE 2017, 12, e0184554. [Google Scholar] [CrossRef] [Green Version]
  25. Nasip, O.F.; Zengin, K. Deep Learning Based Bacteria Classification. In Proceedings of the 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Turkey, 19–21 October 2018; pp. 1–5. [Google Scholar] [CrossRef]
  26. Talo, M. An Automated Deep Learning Appoach for Bacterial Image Classification. In Proceedings of the International Conference on Advanced Technologies, Computer Engineering and Science (ICATCES), Karabuk, Turkey, 11–13 May 2019. [Google Scholar]
  27. Transfer Learning, CS231n Convolutional Neural Networks for Visual Recognition. Available online: https://cs231n.github.io/transfer-learning/ (accessed on 24 November 2021).
  28. Khalifa, N.E.M.; Taha, M.H.N.; Hassanien, A.E. Deep bacteria: Robust deep learning data augmentation design for limited bacterial colony dataset. Int. J. Reason.-Based Intell. Syst. 2019, 11, 9. [Google Scholar] [CrossRef]
  29. Plichta, A. Recognition of species and genera of bacteria by means of the product of weights of the classifiers. Int. J. Appl. Math. Comput. Sci. 2020. [Google Scholar] [CrossRef]
  30. Plichta, A. Methods of Classification of the Genera and Species of Bacteria Using Decision Tree. J. Telecommun. Inf. Technol. 2020, 4, 74–82. [Google Scholar] [CrossRef]
  31. Patel, S. Bacterial Colony Classification Using Atrous Convolution with Transfer Learning. Ann. Rom. Soc. Cell Biol. 2021, 25, 1428–1441. [Google Scholar]
  32. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
  33. Sifre, L. Rigid-Motion Scattering for Image Classification. Ph.D. Thesis, Ecole Polytechnique, CMAP, Palaiseau, France, 2014. [Google Scholar]
  34. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  35. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef] [Green Version]
  36. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 37, pp. 448–456. [Google Scholar]
  37. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  38. DIBaS—Krzysztof Paweł Misztal. Available online: http://misztal.edu.pl/software/databases/dibas/ (accessed on 24 November 2021).
  39. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. arXiv 2015, arXiv:1409.0575. [Google Scholar] [CrossRef] [Green Version]
  40. Yang, H.; Patras, I. Mirror, mirror on the wall, tell me, is the error small? In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4685–4693. [Google Scholar] [CrossRef] [Green Version]
  41. Xie, S.; Tu, Z. Holistically-Nested Edge Detection. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1395–1403. [Google Scholar] [CrossRef] [Green Version]
  42. Salamon, J.; Bello, J.P. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Process. Lett. 2017, 24, 279–283. [Google Scholar] [CrossRef]
  43. Eigen, D.; Fergus, R. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture. arXiv 2015, arXiv:1411.4734. [Google Scholar]
  44. Tutorials|TensorFlow Core. Available online: https://www.tensorflow.org/tutorials (accessed on 24 November 2021).
  45. Keras documentation: Keras API reference. Available online: https://keras.io/api/ (accessed on 24 November 2021).
  46. Le, N.Q.K.; Kha, Q.H.; Nguyen, V.H.; Chen, Y.C.; Cheng, S.J.; Chen, C.Y. Machine Learning-Based Radiomics Signatures for EGFR and KRAS Mutations Prediction in Non-Small-Cell Lung Cancer. Int. J. Mol. Sci. 2021, 22, 9254. [Google Scholar] [CrossRef]
  47. Le, N.Q.K.; Hung, T.N.K.; Do, D.T.; Lam, L.H.T.; Dang, L.H.; Huynh, T.T. Radiomics-based machine learning model for efficiently classifying transcriptome subtypes in glioblastoma patients from MRI. Comput. Biol. Med. 2021, 132, 104320. [Google Scholar] [CrossRef] [PubMed]
  48. Schneider, P.; Müller, D.; Kramer, F. Classification of Viral Pneumonia X-ray Images with the Aucmedi Framework. arXiv 2021, arXiv:2110.01017. [Google Scholar]
  49. Xie, Z.; Deng, X.; Shu, K. Prediction of Protein–Protein Interaction Sites Using Convolutional Neural Network and Improved Data Sets. Int. J. Mol. Sci. 2020, 21, 467. [Google Scholar] [CrossRef] [Green Version]
  50. Pei, Y.; Huang, Y.; Zou, Q.; Zhang, X.; Wang, S. Effects of Image Degradation and Degradation Removal to CNN-Based Image Classification. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1239–1253. [Google Scholar] [CrossRef] [PubMed]
  51. Chen, W.; Xie, D.; Zhang, Y.; Pu, S. All You Need Is a Few Shifts: Designing Efficient Convolutional Neural Networks for Image Classification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7234–7243. [Google Scholar] [CrossRef] [Green Version]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.