SP-ILC: Concurrent Single-Pixel Imaging, Object Location, and Classification by Deep Learning

We propose a concurrent single-pixel imaging, object location, and classification scheme based on deep learning (SP-ILC). We used multitask learning, developed a new loss function, and created a dataset suitable for this project. The dataset consists of scenes that contain different numbers of possibly overlapping objects of various sizes. The results we obtained show that SP-ILC runs concurrent processes to locate objects in a scene with a high degree of precision in order to produce high quality single-pixel images of the objects, and to accurately classify objects, all with a low sampling rate. SP-ILC has potential for effective use in remote sensing, medical diagnosis and treatment, security, and autonomous vehicle control.


Introduction
Single-pixel imaging (SPI) uses a single pixel detector instead of an array detector to obtain information about an object.The random patterns or orthogonal patterns such as Fourier transform basis patterns and Hadamard transform basis patterns are used to encode the light field illuminating on the object and the reflected light is collected by a single-pixel detector [1,2].Single-pixel detectors can operate in spectral bands, such as THz, infrared and X-ray bands [3][4][5], that are not accessible to array detectors because of cost or technical constraints.Single-pixel imaging and the related technique of computational ghost imaging have recently attracted widespread attention [6][7][8][9][10][11][12][13].Single-pixel imaging has important applications in 3D imaging [14,15], LiDAR [16], encrypted communication [17], and many other fields.
Lyu et al. [33] proposed a method of computational ghost imaging based on machine learning in 2017; they were able to recreate high quality images at low sampling rates.Jiao [34] used the S-vector of a single-pixel camera for object classification without the use of images for training or classification in 2018.In addition, in 2018, Higham et al. [35] used deep learning for single-pixel imaging at a frequency that enabled the detection of video images.An important insight in their work was that the speckle pattern used in single-pixel imaging can be treated as a layer of the deep neural network; that is, the speckle pattern can be incorporated into the training process to create an end-to-end system.Wang et al. [36] proposed a method of using simulated data for neural network training in 2019.The training of deep neural networks requires a lot of data, and it is often time-consuming to obtain training data for experiments.The method used by Wang et al. greatly improved the efficiency of neural network training.Zhang et al. [37] developed a method of classifying moving objects using deep learning in 2020.
Great progress has been made in single-pixel imaging and single-pixel classification by combining deep learning [38][39][40].Studies of single-pixel classification have focused on scenes that contain only a single object, with objects in different scenes being similar in size.However, in practice, the number and size of objects in a scene will vary and objects may overlap [41][42][43][44], which raises a fundamental question: Can a single-pixel imaging system correctly locate and classify multiple objects in a scene?We think it can.Although single-pixel imaging and classification are now treated as separate activities, an end-to-end unified scheme has many benefits; for example, a multitask learning based deep neural network can internally share feature maps between different tasks, thus reducing computer time needed for training, make the network more compact, and increase generalizability.
We propose a single-pixel concurrent imaging, object location, and classification scheme using deep learning.Experiments have shown that SP-ILC can concurrently image, locate and classify different numbers of objects of different sizes that overlap.We used multitask learning based on feature multiplexing techniques, constructed a new loss function, and created a test dataset suitable for this project.We have succeeded in concurrently producing high quality single-pixel images, precisely locating objects, and classifying objects with great accuracy at a low sampling rate.SP-ILC may benefit a wide range of applications such as remote sensing and detection, medical treatments, security, and autonomous vehicle control.
SP-ILC offers the following contributions: 1. SP-ILC is the first image recognition system (to the best of our knowledge) to accurately locate and classify individual or multiple different-sized objects in a scene, even when objects overlap, using a single-pixel camera.Current state-of-the-art single-pixel classification systems with deep learning are able to identify and classify only a single object in a scene; 2.
SP-ILC is an end-to-end system based on multitask learning that concurrently detects images, locates objects, and classifies objects in a single process.In contrast to techniques that detect images and identify objects in separate processes, SP-ILC has a compact structure, uses shared feature maps, and has increased generalizability; 3.
We have made the code and dataset associated with this study available to other researchers as open source [45].

Experimental Setup and Structure of the Deep Neural Network
The experimental configuration is shown in Figure 1a.This is a typically SPI geometry.The 532 nm laser (F-IVB-500, Yu Guang Co. Ltd.) beam is expanded by lens 1 and illuminates a digital micromirror device (DMD, DLP7000, Texas Instruments) with an operating frequency of 20 kHz to produce a speckle pattern.The speckle patterns are coded by M random matrices consisting only of values 1 or −1 that are determined by the difference between two consecutive DMD patterns; this method also reduces noise [35].If the kth pattern is denoted by I k (x, y) (k = 1, 2 . . ., M), where x, y are spatial coordinates.The two corresponding DMD patterns are I k (x, y) = [I k (x, y) + 1]/2 and I k (x, y) − I k (x, y), respectively [46].Therefore, M patterns are implemented by 2M DMD frames.The object is displayed by the liquid crystal spatial light modulator (SLM, FSLM-HD70-AP, CAS Microstar), and Lens 2 projects the speckle pattern onto the object.The reflected light passing through the object is received by the bucket detector (DET36A, Thorlab).The light signal detected by the bucket detector undergoes analog-to-digital (A/DC, ADC11C125, Texas Instruments) conversion to produce the bucket signal S k which is defined by: where O(x, y) is the object, the size of which is 64 × 64 pixels.After all measurements end, the obtained bucket signal can be seen a one-dimensional vector [S 1 , . . ., S k . . ., S M ] named S-vector in the following of this paper.S-vector is input to the neural network for imaging, object location, and classification.
Photonics 2021, 8, x FOR PEER REVIEW 3 of 17 detected by the bucket detector undergoes analog-to-digital (A/DC, ADC11C125, Texas Instruments) conversion to produce the bucket signal S k which is defined by: where O(x, y) is the object, the size of which is 64 × 64 pixels.After all measurements end, the obtained bucket signal can be seen a one-dimensional vector [S 1 ,..,S k ...,S M ] named S- vector in the following of this paper.S-vector is input to the neural network for imaging, object location, and classification.The multitask learning deep neural network we used is shown in Figure 1b.The input for training the network is the S-vector of length M and its corresponding label.The S-vector passes through two fully connected layers and is then reshaped to the feature map with a size of 32 × 32 pixels.In addition, this feature map is extracted through the backbone network, which consists of a set of convolutions and two Resblocks.The location and classification tasks are carried out in the backbone network, and a branch from the backbone network is used for the imaging task.The prediction results of the three tasks are summed together to calculate the loss function, and then back-propagated to optimize the parameters.After multiple epochs of training, the final network parameters are determined.The detailed structure of deep neural network of SP-ILC can be found in [45].
In the testing stage, the picture of the test set is displayed on the SLM as the object, and the DMD projects different patterns interacting with the object, and the S-vector is obtained through the single-pixel bucket detector.Next, as shown in Figure 1c, the S-vector obtained through the experiment is sent to the well-trained deep neural network, and SP-ILC will image, locate and classify objects from the one-dimensional S-vector obtained through the experiment concurrently.

Loss Function
The loss function L that we created consists of three variables: where λ t (t = 1, 2, and 3) are hyper-parameters to balance the weight of different tasks.The value of three λ t L t are made to the same level by adjusting the value of λ t to ensure the deep neural network will pay similar attention to the three tasks.The adjusting process to obtain the optimal hyper-parameters by pre-experiment is described in following Section 2.5.L 1 is the loss function of the image quality evaluation task using root mean square error (RMSE): where W and H are the width and height of the picture in pixels, both 64.The original picture label and the picture predicted by the deep neural network are represented respectively by G and Ĝ.In the remainder of this paper, we use the hat notation to distinguish network prediction results from labels.L 2 is the loss function for the image location task: The first summation in Equation ( 4) quantifies the distance of the object from the center of the prediction box, and the second summation quantifies the differences in width and height between the prediction box and the true location.The variables x and y are the coordinates of the center of the object, and w and h are the width and height of the object measured through the center of the frame; R is the number of grids the picture is divided into; B is the number of objects that are to be predicted in each grid; and l obj ij is 1 when the object is in grid i and 0 when the object is not in grid i.
L 3 is the loss function for the image classification task: The first two summations in Equation ( 5) quantify the deviation of the prediction classification confidence with respect to two types of inaccurate prediction: the prediction box contains an object but there is no such object in that location on the original image; or the prediction box does not contain an object but there is an object in that location on the original image.C = 1 means there is an object, C = 0 means there is no object, and λ noobj is the weight.The third summation quantifies the difference between the predicted classification and the actual classification; p i (c) is the probability that the object belongs to the class c.The forms of Equations ( 4) and ( 5) are based on the YOLO loss function in the target detection field [43,44].

Dataset
The MNIST dataset is commonly used in deep learning [48].However, each picture in the dataset contains only a single numeric character, there are no position coordinates, and all the number images are approximately the same size.Thus, the MNIST dataset was unsuitable for our experiment.We created a new dataset to be able to detect, locate and classify images of multiple, different-sized, possibly overlapping objects in different positions.We zoomed in or zoomed out on the 28 × 28 pixel images of single numeric characters, and placed them randomly on a 64 × 64 pixel background.Number images may therefore overlap.The training set we produced contains >180,000 pictures.Each picture was multiplied by an M × 4096 random matrix A to obtain the M × 1 S-vector.The 64 × 64 patterns are obtained by resizing each row of matrix A to display on the DMD.
Each S-vector is associated with a three-part label consisting of the original image (used to train the imaging task), location coordinates (used to train the location task), and the image category (used to train the classification task).We developed a labeling program.Each picture in the MNIST dataset contains an image of only one number, so we obtained the bounding box of the single number in the original image by scanning the pixels one by one from the outside to the inside.The bounding box of the single number was mapped onto a bounding box in the final picture, according to the ratio in the resize operation and the number's location in the final image.The label of the number (in the original dataset) was also used to construct the ground-truth of the final picture.

Training Parameters
The experiments were all conducted on a GeForce RTX 2060 GPU, using the Pytorch framework [49].The network parameters were all initialized to default values and were updated using SGD with Adam optimization [50].We trained the model for 40 epochs, with 810 iterations per epoch.We initialize the first fully connected layer of the network as an identity transformation and initialize the rest at random.We divide the 40 epochs into two parts, each of which sets the initial learning rate to be 0.01 and decreases the learning rate by a factor of 0.1 every 8 epochs.The former part contains 12 epochs, in which the first fully connected layer is frozen.The latter contains 28 epochs with all networks unfrozen.

Other Activities
In this section, we describe how to set the hyperparameters and how to avoid overfitting.Before we used the dataset of 180k computer-generated data for training, we created an 18k dataset for the pre-experiment.In the pre-experiment, we varied different hyperparameters, such as the initial learning rate, the rate of learning rate decay, and λ in loss function.After several runs of tests, we finally selected the values mentioned in the article.
Overfitting is a very significant concern in machine learning.We paid great attention to it during training and adopted an effective method to avoid it.We set the ratio of the number of computer-generated images in the training set to the number of images in the validation set to be 9:1 and tracked the performance of the loss function of each task using these two datasets during training.We would terminate training and recorded the value of the weights before overfitting occurred if we observed that that training loss has decreased in multiple consecutive epochs but the validation loss has increased during training.
Figure 2 shows a typical training process with the loss decrease and each epoch takes about 20 min.validation set to be 9:1 and tracked the performance of the loss function of each task using these two datasets during training.We would terminate training and recorded the value of the weights before overfitting occurred if we observed that that training loss has decreased in multiple consecutive epochs but the validation loss has increased during training.
Figure 2 shows a typical training process with the loss decrease and each epoch takes about 20 min.

Concurrent Imaging, Location, and Classification
The left column of Figure 3 shows the original handwritten objects, which are the four single digit numbers 9, 7, 3, and 0. The right column of Figure 3 shows the results given by SP-ILC from the input S-vector (M = 333).It can be seen that for single objects with different sizes and locations, SP-ILC concurrently detects the image and its location and classifies it.SP-ILC performs single-pixel imaging, marks the position of the object with a rectangular frame, and appends the classification and classifying confidence to the image.

Concurrent Imaging, Location, and Classification
The left column of Figure 3 shows the original handwritten objects, which are the four single digit numbers 9, 7, 3, and 0. The right column of Figure 3 shows the results given by SP-ILC from the input S-vector (M = 333).It can be seen that for single objects with different sizes and locations, SP-ILC concurrently detects the image and its location and classifies it.SP-ILC performs single-pixel imaging, marks the position of the object with a rectangular frame, and appends the classification and classifying confidence to the image.The left column of Figure 4 shows objects that are various handwritten numbers of different sizes.The right column of Figure 4 shows the results obtained by SP-ILC from the input S-vector (M = 333).It can be seen that for multiple objects of different sizes and positions, SP-ILC concurrently performs high-definition imaging, high precision object location and highly accurate classification.SP-ILC performs well on imaging, object location, and classification for overlapping objects [4 and 2 in Figure 4d, 2 and 5 in Figure 4h].
In order to quantify the performance of SP-ILC, we conducted a quantitative study on the test set, as shown in Table 1.The test set (we named it Testset-80 in the following) included 40 single object samples and 40 multiple object samples.Table 1 shows that the performances of SP-ILC on single objects samples are better than that on the multiple object samples since the single object task is easier to process.The peak signal-to-noise ratio (PSNR) and the structural similarity function (SSIM) are widely used to measure image quality [51,52].The PSNR is defined by: where the RMSE is defined in Equation ( 2) and MAX G , which is the maximum value of the image, is 255 in this paper.The SSIM is defined by: where u and σ are the mean values and variances of images respectively, σ G Ĝ is the crosscorrelation coefficient between the original images (ground truth/labels) of objects and the image retrieved by the SP-ILC, and α 1 and α 2 are two positive constants to avoid a null denominator.
Photonics 2021, 8, x FOR PEER REVIEW 8 of 17 In order to quantify the performance of SP-ILC, we conducted a quantitative study on the test set, as shown in Table 1.The test set (we named it Testset-80 in the following) included 40 single object samples and 40 multiple object samples.Table 1 shows that the performances of SP-ILC on single objects samples are better than that on the multiple object samples since the single object task is easier to process.We use Precision and Recall to quantify the accuracy of location and classification respectively.Precision is the percentage of correct predictions in all predictions, and recall is the percentage of correct predictions in all labels [53,54].
The calculation of both precision and recall is based on IOU, the ratio of the intersection to the union of the labeled position (the original bounding Box A) and the predicted position (the predicted bounding Box B): Photonics 2021, 8, 400 9 of 17 We classified only frames with IOU > 0.5; that is, only in cases when IOU > 0.5 and the predicted classification was correct were considered to be correct in both classification and location.
The confidence level is 0.6, which means only predicted bounding boxes with confidence larger than 0.6 are reserved.

Precision-Recall Curve
We examined the precision-recall curves in terms of accuracy of object location and accuracy of classification.The PR curve in Figure 5 shows changes in precision and recall for the object location and classification tasks in response to changes in the confidence level during testing; it is a good measure of network performance.
is the percentage of correct predictions in all labels [53,54].
The calculation of both precision and recall is based on IOU, the ratio of the intersection to the union of the labeled position (the original bounding Box A) and the predicted position (the predicted bounding Box B): We classified only frames with IOU > 0.5; that is, only in cases when IOU > 0.5 and the predicted classification was correct were considered to be correct in both classification and location.
The confidence level is 0.6, which means only predicted bounding boxes with confidence larger than 0.6 are reserved.

Precision-Recall Curve
We examined the precision-recall curves in terms of accuracy of object location and accuracy of classification.The PR curve in Figure 5 shows changes in precision and recall for the object location and classification tasks in response to changes in the confidence level during testing; it is a good measure of network performance.In general, when the confidence level is increased, precision will also increase and recall will decrease.If the confidence level is decreased, precision will also decrease but recall will increase.The confidence level can be varied independently for the different tasks in order to reach the best operational system state.For example, for disease detection, recall is more important than precision; thus, recall needs to be very high, and there is some tolerance for low precision.We can decrease the confidence level to achieve the optimal solution for this type of task.The confidence level can be similarly adjusted for other tasks to achieve a balance between precision and recall.

Generalizaiton Ability
To proof the generalization ability of SP-ILC, we use the trained SP-ILC to test the double MNIST and triple MNIST datasets.A total of 20 examples of double MNIST and 20 examples of triple MNIST are randomly chosen.As shown in Figure 6 and Table 2, SP-ILC works well in the double MNIST and triple MNIST datasets, although all the trained datasets of these two datasets are not used in the training of SP-ILC.Compared Table 1 to Table 2, the difficulties of multiple objects of Testset-80 are larger than that of the double and triple MNIST.Because the size of double and triple MNIST is similar and there is no overlap between different objects.6 and Table 2, SP-ILC works well in the double MNIST and triple MNIST datasets, although all the trained datasets of these two datasets are not used in the training of SP-ILC.Compared Table 1 to Table 2, the difficulties of multiple objects of Testset-80 are larger than that of the double and triple MNIST.Because the size of double and triple MNIST is similar and there is no overlap between different objects.The Testset-80, double MNIST, and triple MNIST datasets contained only images of digits because the purpose of this paper is to demonstrate the feasibility of concurrent single-pixel imaging, object location, and object classification using deep learning.

Optimal Patterns
The results of Sections 3.1-3.3are obtained by using the random pattern.In this section, we numerically study the improvement of the performance of SP-ILC using optimal patterns.
As shown in Figure 7 and Table 3, the performances of both the ordered Hadamard patterns and the trained patterns are generally better than that of the random patterns in different datasets.The Hadamard patterns are ordered by the number of connected regions [55].These improvements are consistent with previous studies [35,37].This simulation study shows that the optimal patterns, especially trained or 'learned' patterns, can be used to further improve the performance of SP-ILC.

Fashion MNIST
Although the above studies focus on MNIST which consists only of images of digits, the proposed method may also work well in parsing scenes that contain images of realworld objects as long as a training dataset containing images of the actual objects is provided.To demonstrate this, we numerically studied the performance of SP-ILC on the dataset of the Fashion MNIST.The Fashion MNIST is a dataset that contains 70 K grayscale images (the pixel-value is an integer between 0 and 255), associated with labels from 10 classes [56].
Using the method described in Section 2.3, we prepared a dataset with 40 K samples to train the SP-ILC.The hyper-parameters and parameters are the same as that set on MNIST experiments.Particularly, the pattern number is also M = 333.A total of 100 test samples with a single object and 100 test samples with multiple objects are used to test the performance of SP-ILC.The test results are shown in Figure 8 and Table 4.The precision and recall are calculated in the confidence level of 0.6.

Fashion MNIST
Although the above studies focus on MNIST which consists only of images of digits, the proposed method may also work well in parsing scenes that contain images of realworld objects as long as a training dataset containing images of the actual objects is provided.To demonstrate this, we numerically studied the performance of SP-ILC on the dataset of the Fashion MNIST.The Fashion MNIST is a dataset that contains 70 K grayscale images (the pixel-value is an integer between 0 and 255), associated with labels from 10 classes [56].
Using the method described in Section 2.3, we prepared a dataset with 40 K samples to train the SP-ILC.The hyper-parameters and parameters are the same as that set on MNIST experiments.Particularly, the pattern number is also M = 333.A total of 100 test samples with a single object and 100 test samples with multiple objects are used to test the performance of SP-ILC.The test results are shown in Figure 8 and Table 4.The precision and recall are calculated in the confidence level of 0.6.It can be seen that the SP-ILC works in the dataset based on the Fashion MNIST, which are grayscale images and closer to real-world images compared to the MNIST.

Analysis of Imaging Ability
I. Hoshi et al. provided a detail study of the performance of the convolutional neural network (CNN)-based method and the recurrent neural network (RNN)-based method on MNIST and Fashion MNIST datasets using numerical experiments [57].The number of patterns is also 333 and the size of the object is 64 × 64 in their experiments.In Section 3.5 of this paper, the random pattern is set to 0 and 1, and the optimal pattern is set to 1 and −1.This setting is the same as the Ref. [57].We compared both the CNN-based method [35] and RNN-based [57] method with SP-ILC in original Fashion MNIST.
The quantitative evaluation of the CNN-based and RNN-based methods are calculated according to Table 2 of Ref. [57].The quantitative performance of the SP-ILC is calculated by testing 100 randomly chosen samples from the standard Fashion MNIST test set.The only operation for these 100 samples is resizing them from 28 × 28 to 64 × 64 for a fair comparison.
Table 5 shows the performance of SP-ILC exceeds both the CNN-based method and RNN-based method.It can be seen that the SP-ILC works in the dataset based on the Fashion MNIST, which are grayscale images and closer to real-world images compared to the MNIST.

Analysis of Imaging Ability
I. Hoshi et al. provided a detail study of the performance of the convolutional neural network (CNN)-based method and the recurrent neural network (RNN)-based method on MNIST and Fashion MNIST datasets using numerical experiments [57].The number of patterns is also 333 and the size of the object is 64 × 64 in their experiments.In Section 3.5 of this paper, the random pattern is set to 0 and 1, and the optimal pattern is set to 1 and −1.This setting is the same as [57].We compared both the CNN-based method [35] and RNN-based [57] method with SP-ILC in original Fashion MNIST.
The quantitative evaluation of the CNN-based and RNN-based methods are calculated according to Table 2 of [57].The quantitative performance of the SP-ILC is calculated by testing 100 randomly chosen samples from the standard Fashion MNIST test set.The only operation for these 100 samples is resizing them from 28 × 28 to 64 × 64 for a fair comparison.
Table 5 shows the performance of SP-ILC exceeds both the CNN-based method and RNN-based method.

Analysis of Classification Ability
SP-ILC is the first work that can locate and classify multiple objects and objects of different sizes and overlap using a single-pixel detector, although there are previous studies that achieved the single-pixel classification for the scene with a single object and even the object is high-speed moving [34,37].The metric for these works for single object classification is accuracy.However, for object location or object identification, it is hard to calculate the accuracy [58], since the calculation of accuracy relies on the true negative (TN) and for any given image, the number of TN is infinite because there is infinite number of bounding boxes that should not be detected [59].
Therefore, we designed the test set with 40 differently-sized MNIST test samples that contained only a single object to provide some degree of comparison (four examples are shown in Figure 2).When using a single-object test set, at a sampling rate of 8.1% (sampling rate is calculated by M/N = 333/4096), for 40 single target pictures, SP-ILC predicted 43 bounding boxes, of which 40 were correct.The precision of this performance was 0.930 (40/43).A total of 43 bounding boxes mean that SP-ILC may give more than one prediction for one object, for example, in Figure 4h, the number '4' has been given two predictions, which will decrease the precision of the system.The recall was 1.000 (40/40).This classification performance matches that of current state-of-the-art classification tasks [34,37].We note that in previous studies of the single-pixel classification there is prior knowledge that a scene contains only one object.SP-ILC does not have such prior knowledge, so even for a single object scene, our task is more difficult than current classification algorithms.Moreover, other studies have used similar-sized objects, whereas the object size in our test set varies.

The Test Time for One Image
The measurement number is M = 333 (displayed by 666 DMD patterns); the speed of the pattern change of the DMD is 20 kHz; the test time of SP-ILC for one frame of image costs about 28 ms (using one single GeForce RTX 2060 GPU).Therefore, the SP-ILC has the potential to achieve real-time concurrent imaging, object location, and classification.

The End-to-End Multitask Learning System
SP-ILC is an end-to-end multitask learning system.In comparison with sequential pipelines (do SPI first and then send the image to an object identification system), SP-ILC offers many benefits: (1) the multitask learning deep neural network is structurally more compact than sequential pipelines and dispenses with the training of multiple models; (2) the multitask learning technique can share shadow layers of the deep neural network between different tasks, which promotes efficient learning; and (3) multitask learning increases generalizability, which is very important for deep learning models because different tasks have components unrelated to other tasks that can be ignored as noise by the other tasks, thus making the model more robust [60,61].
In summary, this paper proposes SP-ILC to perform concurrent imaging, object location, and classification using deep learning.Through feature sharing, multitask loss

Figure 1 .
Figure 1.SP-ILC schematic.(a) The 532 nm laser is modulated by patterns displayed by the digital micromirror device (DMD).The object is displayed on the spatial light modulator (SLM) and the reflected light is collected by the single-pixel detector (PD) to create the bucket signal S. P is a linear polarizer.PBS is a polarization beam splitter.A/DC is an analog-to-digital convertor.PC is a personal computer with a GeForce RTX 2060 graphics processing unit (GPU).(b) The deep neural network concurrently performs imaging, classification, and location tasks using the single S-vector.The

Figure 1 .
Figure 1.SP-ILC schematic.(a) The 532 nm laser is modulated by patterns displayed by the digital micromirror device (DMD).The object is displayed on the spatial light modulator (SLM) and the reflected light is collected by the single-pixel detector (PD) to create the bucket signal S. P is a linear polarizer.PBS is a polarization beam splitter.A/DC is an analog-to-digital convertor.PC is a personal computer with a GeForce RTX 2060 graphics processing unit (GPU).(b) The deep neural network concurrently performs imaging, classification, and location tasks using the single S-vector.The visualization is drawn using PlotNeuralNet software [47].The numbers with slope alignment are the sizes of the feature maps and the numbers with horizontal alignment are the channel numbers of the feature maps.(c) The input and output of the SP-ILC with the trained deep neural network.

Figure 2 .
Figure 2. A typical training process with loss decrease.

Figure 2 .
Figure 2. A typical training process with loss decrease.

Figure 3 .
Figure 3. Examples of single-pixel imaging, object location, and classification for a single object with different sizes and positions.The left column shows original images of objects displayed on SLM; the right column shows the retrieved images by SP-ILC.Concurrently, the location and classification of objects are also output by SP-ILC.The locating result (predicted bounding box) of one object is marked by a rectangle; two numbers above each rectangle (bounding box) are the classifying result and the corresponding classifying confidence, respectively.The left column of Figure4shows objects that are various handwritten numbers of different sizes.The right column of Figure4shows the results obtained by SP-ILC from the input S-vector (M = 333).It can be seen that for multiple objects of different sizes and positions, SP-ILC concurrently performs high-definition imaging, high precision object location and highly accurate classification.SP-ILC performs well on imaging, object location, and classification for overlapping objects [4 and 2 in Figure4d, 2 and 5 in Figure4h].

Figure 3 .
Figure 3. Examples of single-pixel imaging, object location, and classification for a single object with different sizes and positions.The left column shows original images of objects displayed on SLM; the right column shows the retrieved images by SP-ILC.Concurrently, the location and classification of objects are also output by SP-ILC.The locating result (predicted bounding box) of one object is marked by a rectangle; two numbers above each rectangle (bounding box) are the classifying result and the corresponding classifying confidence, respectively.

Figure 4 .
Figure 4.The left column (a,c,e,g) show original images of objects displayed on the SLM; the right column (b,d,f,h) show results of single-pixel imaging, classification, and location for scenes containing multiple objects with different sizes and locations and overlap.

Figure 4 .
Figure 4.The left column (a,c,e,g) show original images of objects displayed on the SLM; the right column (b,d,f,h) show results of single-pixel imaging, classification, and location for scenes containing multiple objects with different sizes and locations and overlap.

Figure 6 .
Figure 6.The performance of SP-ILC for (a) double MNIST and (b) triple MNIST.The first and third rows show the original image of objects; the second and fourth rows show results of the concurrent imaging, object locating and object classifying.

Figure 6 .
Figure 6.The performance of SP-ILC for (a) double MNIST and (b) triple MNIST.The first and third rows show the original image of objects; the second and fourth rows show results of the concurrent imaging, object locating and object classifying.

Figure 7 .
Figure 7.The precision-recall curves of SP-ILC tested on (a) Testset-80 and (b) double and triple MNIST dataset for different kinds of patterns.

Figure 7 .
Figure 7.The precision-recall curves of SP-ILC tested on (a) Testset-80 and (b) double and triple MNIST dataset for different kinds of patterns.

Figure 8 .
Figure 8.The performance of SP-ILC, which is based on trained patterns, for the Fashion MNIST.The first row shows the original image of objects; the second row shows results of the concurrent imaging, object locating and object classifying.

Table 1 .
The quantitative performance of SP-ILC for Testset-80.

Table 1 .
The quantitative performance of SP-ILC for Testset-80.

Table 2 .
The quantitative performance of SP-ILC for double MNIST and Triple MNIST.
double MNIST and triple MNIST datasets.A total of 20 examples of double MNIST and 20 examples of triple MNIST are randomly chosen.As shown in Figure

Table 3 .
The quantitative performance of SP-ILC for different datasets using optimal patterns.

Table 4 .
The quantitative performance of SP-ILC for Fashion MNIST.

Table 4 .
The quantitative performance of SP-ILC for Fashion MNIST.

Table 5 .
The quantitative performance comparison of SP-ILC to CNN and RNN based methods for Fashion MNIST.