Continuous Viewpoint Planning in Conjunction with Dynamic Exploration for Active Object Recognition

Active object recognition (AOR) aims at collecting additional information to improve recognition performance by purposefully adjusting the viewpoint of an agent. How to determine the next best viewpoint of the agent, i.e., viewpoint planning (VP), is a research focus. Most existing VP methods perform viewpoint exploration in the discrete viewpoint space, which have to sample viewpoint space and may bring in significant quantization error. To address this challenge, a continuous VP approach for AOR based on reinforcement learning is proposed. Specifically, we use two separate neural networks to model the VP policy as a parameterized Gaussian distribution and resort the proximal policy optimization framework to learn the policy. Furthermore, an adaptive entropy regularization based dynamic exploration scheme is presented to automatically adjust the viewpoint exploration ability in the learning process. To the end, experimental results on the public dataset GERMS well demonstrate the superiority of our proposed VP method.


Introduction
Visual object recognition plays an important role in the fields of computer vision and robotics. It has been successfully applied into a large number of tasks, e.g., autonomous driving, manipulation and grasping, monitoring security, transportation surveillance [1], etc.
Most recognition systems exclusively focus on static image recognition, that is, the systems take a single snapshot as input and generate a category label estimate as output [2]. It is easy to produce recognition errors when the single-view image can not provide enough information. However, the vision behavior of people is exploratory, probing, and searching in order to better understand their surroundings. For example, you will go to the front of a person to confirm when you can not identify him from his back. Thus, if the viewpoint of an agent (e.g., an automatic mobile robot with a head mounted camera) is allowed to be changed, more detailed information will be collected to improve the performance of recognition.
The idea described above fits into the realm of active object recognition (AOR) [3][4][5], which gathers additional evidence to improve recognition performance by purposefully adjusting the viewpoint (position and orientation) of an agent. Many classic and latest AOR approaches are reviewed in [6,7]. The main focus of AOR research is viewpoint planning (VP) which means how to determine the next best viewpoint of the agent. A good VP policy can greatly ameliorate the recognition performance. In recent years, reinforcement learning has attracted growing research attention on viewpoint planning [8][9][10][11][12]. The agent is able to learn a good VP policy under the guidance of hand-designed reward functions. The main algorithms involved in the learning process are dynamic programming [8] and Q-Learning [9][10][11][12]. Both dynamic programming based and Q-Learning based methods have made a great contribution to AOR. However, these VP methods explore discrete viewpoint space, which have to sample viewpoint space and may bring in significant quantization error.
To alleviate this problem, we propose a continuous viewpoint planning approach for AOR based on reinforcement learning in this work. The approach can effectively explore the continuous viewpoint space. To be specific, we employ recently presented proximal policy optimization (PPO) [13] framework to tackle the VP problem. The VP policy is represented by a Gaussian model that can be monotonically improved by the clipping mechanism of PPO. In addition, the standard deviation of the Gaussian model implies the viewpoint exploration ability, which represents the opportunity to try new viewpoints. As shown in Figure 1, the larger the standard deviation is, the stronger the exploration ability is. If the standard deviation is fixed in the whole policy learning process (fixed exploration), two unpleasant results will be produced: (1) the VP policy may stuck in local optimum due to insufficient exploration when the standard deviation is small; (2) the optimal VP policy can not be obtained when the standard deviation is large (because the optimal VP policy is a deterministic policy which is approximately equivalent to a Gaussian model with the small standard deviation). So, in the field of reinforcement learning, it generally hopes to have a higher exploration in the early stage of policy learning and gradually reduce it in the later in order to obtain a better policy [14]. Therefore, we develop a dynamic exploration scheme to automatically adjust viewpoint exploration in the learning process. The scheme is implemented by using separate neural networks for the representation of policy mean and standard deviation and training the mean and standard deviation at the same time. Moreover, entropy regularization [15] is introduced and improved to an adaptive version to prevent the exploration from shrinking prematurely. The experimental results on the public dataset GERMS [12] strongly support the effectiveness of our proposed VP method. The contributions of our work are as follows: • A novel continuous viewpoint planning method for active object recognition based on proximal policy optimization is proposed to deal with the problem of quantization error of discrete viewpoint planning methods; • An adaptive entropy regularization based dynamic exploration scheme is presented to automatically adjust viewpoint exploration in the learning process; • Experiments are carried out on the public dataset GERMS, and the proposed method obtains rather promising results.
The remainder of the paper is laid out as follows. Section 2 reviews the related research. Section 3 formulates the problem. Section 4 details our continuous viewpoint planning method. Section 5 shows the experiment results and analysis whereas we draw conclusions in Section 6.

Related Work
This section reviews related work about active object recognition and proximal policy optimization.
Active Object Recognition: Becerra et al. [8] model object detection as a Partially Observable Markov Decision Process problem, which is solved using Stochastic Dynamic Programming. In [9], researchers formally define the viewpoint selection as an optimization problem and use reinforcement learning for viewpoint training without user interaction. Malmir et al. [12] contribute a image-based AOR publicly dataset named GERMS and propose a deep Q-learning (DQL) system that learns to actively examine objects by minimizing overall classification error using standard back-propagation and Q-learning. Similarly, Liu et al. develop a hierarchical local-receptive-field-based extreme learning machine architecture to learn the state representation and utilize Q-learning to find the optimal policy [10]. In [11], researchers treat AOR as a Partially Observable Markov Decision Process and find corresponding action-values of training data using belief tree search. All above methods explore discrete viewpoint space, which may miss a few important object information owing to the quantization error of viewpoint. Therefore, we develop a continuous VP method for AOR to address this problem. The closest method to ours in this respect is [16] which resorts trust region policy optimization (TRPO) framework [17] to tackle the quantization error problem and shows better results on the dataset GERMS compared to the Q-Learning methods. However, in the TRPO-based AOR method, linear approximation of the optimization objective and quadratic approximation of the constraint are used to jointly direct policy update, leading to relatively high computation complexity. Although the researchers wisely employ extreme learning machine [18] to alleviate this problem, the learning speed is still unsatisfactory. Different from [16], we adopt a firstorder optimization framework PPO [13] for continuous VP learning. It is computationally efficient and is able to guarantee monotonic performance improvement of VP policy. In addition, the VP policy standard deviation in [16] is fixed and small, which makes the viewpoint exploration insufficient during the learning process, resulting in the policy stuck in local optimum. However, we develop a dynamic exploration scheme in our work to automatically adjust the standard deviation in the learning process in order to obtain a better policy.
Proximal Policy Optimization: PPO has achieved significant successes in enormous applications. Gangapurwala et al. [19] introduce a guided constrained policy optimization framework based on PPO which guarantees the behavior of real quadruped robot within required safety constraints during training process. A centralized coordination scheme of automated vehicles at an intersection without traffic light using PPO is proposed to solve low computation efficiency suffered by state-of-the-art methods [20]. In [21], researchers apply PPO to the task of image captioning to establish a further improvement for the training phase of reinforcement learning. In [22], researchers propose an integrated metro service scheduling and train unit deployment with a PPO approach based on the deep reinforcement learning framework. A variant of PPO algorithm called memory proximal policy optimization is presented to solve quantum control tasks [23]. In [24], a PPO-based machine learning algorithm is implemented to decide on the replenishments of a group of collaborating companies. However, to our best knowledge, PPO has never been resorted for AOR task. In our work, it is firstly utilized for AOR to learn a continuous VP policy.

Problem Statement
In a visual AOR system, an agent will be automatically moved to capture images from different viewpoints to recognize an object. The current viewpoint is known to the agent in the recognition system. Specifically, at initial time t " 0, the viewpoint of agent is ϕ 0 and the captured image is I ϕ 0 . According to I ϕ 0 , we can predict the label of the object to be recognized using a classifier. It is often that the single viewpoint image may be not sufficient to give a robust recognition result, we should move the agent to capture more images to improve the recognition performance. This requires us to plan an relative movement action a t (i.e., VP) for the agent to obtain a new viewpoint that is ϕ t`1 " ϕ t`at . Then, the new image I ϕ t`1 captured in the viewpoint ϕ t`1 will be used for the recognition again. The process like this will be repeated until a stop criteria is reached, such as the maximum of T steps.
An arbitrary action may lead to a worse view where the captured image does not provide useful information for recognition. Therefore, an effective VP policy is desirable. To this end, we consider the VP problem as a reinforcement learning one which is formulated as a six-element tuple ă S, A, r, P, γ, π ą. S denotes the state space where every element s is generated by the images acquired from different viewpoints of an agent. A is the continuous action space where every action a is used to move the agent to a new viewpoint. r : SˆAÑR is a reward function designed to assess the value of one action in a certain state. P : SˆAˆSÑr0, 1s means the transition probability to the next state when an action is selected in the current state. γ P r0, 1s is a discount factor that represents the difference in importance between future rewards and present rewards. π : SˆAÑr0, 1s is an continuous VP policy that describes the probability of selecting one action to produce a new viewpoint in a certain state. In the reinforcement learning setting, the VP problem is transformed to find the optimal policy π˚, which can move the agent to the best recognition viewpoints.

Proposed Method
To obtain the optimal continuous VP policy π˚for AOR, we employ PPO framework [13] to tackle this problem. Figure 2 shows our AOR pipeline based on PPO.   Figure 2. The proposed AOR pipeline. The pipeline adopts PPO framework [13] to learn the continuous VP policy π θ that is denoted by a parameterized Gaussian model. In order to realize dynamic exploration, two separate neural networks are used for the representation of the policy mean and standard deviation of the Gaussian model and trained concurrently. During the training process, the policy π θ is improved by collecting some sample trajectories ts t , a t , rps t , a t qu T t"0 and optimizing the PPO objective.
During policy training process, at each time step t, an agent observes the state s t P S, takes an action a t P A under current VP policy π (i.e., a t "πpa t |s t q), generates a new state s t`1 "Pps t`1 |s t , a t q, and receives a scalar reward rps t , a t q. Starting from arbitrary initial state s 0 at time t " 0, the cumulative discounted reward function is where Er¨s denotes the expectation operator. T is the maximum number of planning. ηpπq is used to evaluate different VP polices. A better VP policy corresponds to a higher value of ηpπq. We assume that VP policy π is parameterized by θ and denote it as π θ . Thus, to find the optimal continuous VP policy π˚is to find the optimal parameter θ˚that can be solved by θ˚" arg max θ ηpπ θ q.
The recent PPO framework [13] is adopted to address the optimization problem (2) in an iterative updating way. Let π θ old be the old policy, π θ be the new policy after the policy update, and κpθq be the probability ratio κpθq " π θ pa t |s t q{π θ old pa t |s t q. In the PPO framework, θ˚in (2) can be achieved by maximizing a clipping surrogate objective (The detailed derivation process from (2) to (3) can refer to [13,17].): where is a hyper-parameter to control the clipping ratio. A π θ old ps t , a t q is advantage function under the old policy π θ old , which is detailed in Section 4.4. In the following, we will elaborate the representation of state s t , continuous VP policy π θ , and reward function rps t , a t q in our PPO-based AOR pipeline and develop a training algorithm to solve the optimization problem in (3).

Belief Fusion for State Representation
As shown in Figure 2, the captured image I ϕ t is first transformed into a series of convolutional neural network (CNN) features. We then add a so f tmax layer on the top of the CNN model to identify the concerned objects. The output of the so f tmax layer is a vector that means the recognition belief over different objects. We denote the oth element of the belief vector as Ppo|I ϕ t q where o " 1, 2, ..., M is the object label. Like [25], the belief Ppo|I ϕ t q is fused with the accumulated belief Ppo|I ϕ 0 , I ϕ 1 , ..., I ϕ t´1 q from previous images using Naive Bayes: The fusion result Ppo|I ϕ 0 , I ϕ 1 , ..., I ϕ t q is the new accumulated belief at time step t. β t is a normalizing coefficient ( β t " 1{ ř o Ppo|I ϕ t qPpo|I ϕ 0 , I ϕ 1 , ..., I ϕ t´1 q ) that makes ř o Ppo|I ϕ 0 , I ϕ 1 , ..., I ϕ t q " 1 hold. In this work, the accumulated belief is used for the representation of the recognition state (i.e., s t " Ppo|I ϕ 0 , I ϕ 1 , ..., I ϕ t q, o " 1, 2, ..., M) at each time step. It is worth noting that the parameters of the classifier (composed of the CNN model and the so f tmax layer) are pre-trained with the images from different viewpoints of different objects and invariable during the training process of continuous VP policy.

Continuous VP Policy Network Combined with Dynamic Exploration
Similar to [16], the continuous VP policy is represented by a parameterized Gaussian distribution. However, ref. [16] only parameterizes the policy mean µ with a neural network, that is, π θ pa|sq " N pµ θ psq, ř q (Viewpoint is composed of orientation and position, so the planning action a may be a multi-dimensional vector. Therefore, the Gaussian model may be a multivariate form. It is usually assumed that the variables in a are independent of each other, so the covariance matrix ř is a diagonal matrix, i.e., ř " diagpσ 2 1 , σ 2 2 , ..., σ 2 d q. σ is standard deviation and d is the dimension of a.). The standard deviations in the covariance matrix ř are small and invariable in the whole training process. As analyzed in Section 1, the standard deviation implies the viewpoint exploration ability, the fixed small standard deviation may make the VP policy stuck in local optimum due to insufficient exploration. Therefore, an adaptive entropy regularization based dynamic exploration scheme is developed to automatically adjust the standard deviation in the training process in order to obtain a better policy. The research process and implementation details of the scheme are as follows.
Parameterization of the Policy Mean and Standard Deviation: The scheme is first realized by concurrently parameterizing the policy mean and standard deviations with two separate neural networks (µ θ psq or µps; θq and σ θ psq or σps; θq) and training them at the same time. As shown in Figure 2, µ θ psq and σ θ psq are two single hidden-layer fullyconnected neural networks which take state as input and output the mean vector and standard deviation vector. The parameters of them are collectively called θ. Consequently, the VP policy is recorded as π θ pa|sq " N pµ θ psq, ř θ psqq which is expanded to The ith element of the mean vector and standard deviation vector are represented as µ i ps; θq and σ i ps; θq, respectively. d is the dimension of action a. During training, the update of parameter θ under the PPO framework will simultaneously affect the policy mean and standard deviations, leading to the dynamic exploration. Entropy Regularization: As stated in Section 1, in reinforcement learning, it generally hopes to have a higher exploration in the early stage of policy learning and gradually reduce it in the later in order to obtain a better policy [14]. However, we find the standard deviations shrink prematurely and adjust in a small range in the training process. As shown in Figure 3, it is the change of standard deviation in the training process of GERMS dataset [12] which has a single action dimension (A shrinkage case with two action dimensions is shown in [14]). It shrinks rapidly to a small value soon after the beginning of training and always keeps in a small value range (the curve with c " 0 in Figure 3), which may also result in the insufficient exploration. To address this problem, we then introduce entropy regularization [15] to the PPO optimization objective (3) to prevent the exploration from shrinking prematurely. Therefore, (3) is transformed into: max θ L Ent pθq " E π θ old rminpκpθqA π θ old ps t , a t q, clippκpθq, 1´ , 1` qA π θ old ps t , a t qq`cHpπ θ p¨|s t qqs, where c is a constant coefficient and Hp¨q is entropy operator (Hpxq "´ş ppxq log ppxq or Hpxq "´ř ppxq log ppxq. The entropy of a multivariate normal distribution is 1 2 log p2πeq d | ř |.). Adaptive Entropy Regularization Coefficient: In our experiment, we find the constant coefficient c in (6) is a hyper-parameter that is difficult to tune. As shown in Figure 3, when c is less than or equal to 0.03, entropy regularization fails to prevent the premature decay of exploration; when c is greater than 0.03, the standard deviation increases explosively. Thus, to tackle this problem, we last propose an adaptive entropy regularization method that can adapt the coefficient to achieve the appropriate exploration ability in the training process. The coefficient c in (6) is improved to where c div is a divergence coefficient such as 0.04, 0.05, 0.1, 0.3, and 0.5 in Figure 3. If the planning action is multidimensional, then c div is a coefficient that makes the standard deviation of each dimension diverge. σ H i ptq and σ L i ptq are the i-dimensional upper and lower boundaries of the standard deviation you want to maintain in the training. They are the functions of training time node t. In our work, we model them as stage functions shown in Figure 4. To be specific, the stage functions in a certain dimension are defined as T W is the training duration of each stage. According to it, the total training time can be evenly divided into several stages. σ 0 is the initial standard deviation. σ ∆ is the increment of the standard deviation. σ S is the boundary range. r¨s is the rounding operator, e.g., r1{3s " 0. maxpt´T M , 0q is to increase the training time of the first stage by T M . As shown in Figure 3, this is because it takes some time to raise the standard deviation to the boundary value of the first stage at the beginning of training.  [12] under different dynamic exploration schemes. Because the standard deviation is a function of the state, the standard deviation representing the exploration ability refers to the average of the standard deviationσ corresponding to all states. However, there are infinite states, soσ can not be calculated. In the training, we use the average of the standard deviation of some sample states to approximately replaceσ. We implement three dynamic exploration schemes step by step: (1) the first is the simultaneous parameterization of policy mean and standard deviation with two separate neural networks (the curve with c " 0); (2) the second is to add the constant coefficient entropy regularization on the basis of (1) (the curves with c " 0.03, 0.04, 0.05, 0.1, 0.3, or 0.5); (3) the third is that the constant coefficient is improved into an adaptive version on the basis of (2) (the curve with Adaptive c). After experimental comparison, scheme (3) can meet our dynamic exploration need.
Training time node After experimental verification, the dynamic exploration with adaptive entropy regularization can meet our exploration requirement.

Reward Setting
Reward function rps t , a t q plays an important role in encouraging effective viewpoint selection. In Section 4.1, the recognition state (s t " Ppo|I ϕ 0 , I ϕ 1 , ..., I ϕ t q, o " 1, 2, ..., M) describes a probability distribution over different objects. The flatter the distribution is, the stronger the recognition ambiguity is. Here, we resort information entropy [26,27] to quantify the ambiguity. Then the ambiguity in state s t is represented as Hps t q. The goal of AOR is to eliminate this ambiguity to improve recognition performance by viewpoint planning. A beneficial viewpoint attempt can reduce the current ambiguity. Therefore, we design the reward function according to the ambiguity in different states after viewpoint selection. Letô t`1 be the predicted result and o˚be the label of the image in the new viewpoint (I ϕ t`1 " I ϕ t`a t ). Among them,ô t`1 " argmax o Ppo|I ϕ 0 , I ϕ 1 , ..., I ϕ t`1 q. If the predicted resultô t`1 is right and the information entropy Hps t`1 q is smaller than Hps t q in state s t , it means that the VP action a t in state s t is useful for recognition. Then the agent will receive a positive reward. Otherwise, the reward is non positive when the entropy does not decrease or the prediction is wrong. To sum up, the reward function is formulated as ,ô t`1 ‰ o0 ,ô t`1 " o˚, Hps t`1 q ě Hps t q, 1,ô t`1 " o˚, Hps t`1 q ă Hps t q (9) where rps t , a t q can be denoted as r t for simplicity.

Training the Policy Network
To solve the optimization problem in (6), we develop a training algorithm to iteratively update θ in the policy network. The algorithm shown in Algorithm 1 is Actor-Critic style [15].
To replace the expectation operator in (6), we apply Monte Carlo method [28] to deal with it in an approximate manner. Specifically, we repeat N times to run the old policy π θ old for T time steps to collect a trajectory ts t , a t , r t , s t`1 u T t"0 . With N trajectories, (6) can be approximated as: , clippκ piq pθq, 1´ , 1` qA piq π θ old ps t , a t qq`cHpπ piq θ p¨|s t qqs. (10) The advantage function A π θ old ps t , a t q can be estimated using the technology of generalized advantage estimation (GAE) [29]: where δ t " r t`γ V π θ old ps t`1 q´V π θ old ps t q.
V π θ old p¨q is state value function under the old VP policy π θ old . It is approximately represented by a two-layer fully connected network with parameter ω. The network maps the state s t to the function value Vps t ; ωq. We update ω to obtain the state value function corresponding to different VP policies. We use the N trajectories (sampled by π θ old ) again to fit the state value function V π θ old ps t ; ωq of the old policy π θ old by solving the optimization problem: Algorithm 1: Training the continuous VP policy network Input: Parameters: L, N, T, K A , K C . Output: Parameter θ. 1 Create a new policy network, an old policy network and a state value network with parameters θ, θ old , and ω, respectively. The new and old policy network has the same network structure. Initialize the parameters θ, θ old , and ω randomly. 2 for episode Ð 1 to L do 3 θ old " θ 4 for i Ð 1 to N do 5 Run policy π θ old pa|sq for T time steps, collecting a trajectory ts t , a t , r t , s t`1 u T t"0 where a t " πpa t |s t q, s t`1 " Pps t`1 |s t , a t q. 6 For each t in every trajectory, estimate advantage function A π θ old ps t , a t q according to (11). 7 for i Ð 1 to K A do 8 OptimizeL Ent pθq in (10) w.r.t. θ with NpT`1q size or M ď NpT`1q minibatch size samples.

11
Adapt the entropy regularization coefficient c in (10) in the light of (7) with the new policy network π θ and NpT`1q samples.

return θ
V Target ps t q is not involved in the optimization procedure. It is calculated using V Target ps t q " r t`γ r t`1`. ..`γ T´t´1 r T´1`γ T´t V π θ old ps T ; ωq in advance. The iterative update process of (10) and (12) is shown in lines 7-10 of Algorithm 1. Once the optimal parameter θ˚is obtained, it can be used for the practical AOR task. In state s t , the planned action is a t " N pµ θ˚p s t q, ř θ˚p s t qq, and the next best viewpoint is ϕ t`1 " ϕ t`at .

Experimental Setup
Dataset and Metric: The GERMS dataset [12] shown in Figure 5 is collected in the context of the RUBI project whose intention is to develop a robot that interact with toddlers in early childhood education. It is composed of 1365 video tracks of give-and-take trials using 136 different soft toy objects. The tracks are divided according to the arm of the robot, with roughly half the training and testing tracks being the left arm and the other half the right arm. Each trial generates a track that records the robot putting the grasped object in its center of view, rotating it by 180°and then returning it. During the trial, the robot continuously saves images from its head-mounted camera at 30 frames per second, as shown in Figure 6. Meanwhile, the joint position and object label are recorded. These data are stored in a track, a series of which constitutes the dataset. On average, each track contains 150 images, Table 1 outlines the number of images in the dataset. These joint positions in each track allow researchers to simulate different VP methods in one dimensional action space. The performance of different VP methods is evaluated using recognition accuracy that is the average of the entire test set. Figure 5. The GERMS dataset [12]. The objects are soft toys describing various human cell types, microbes and disease-related organisms.  Implementation Details: In this work, we employ the Tensorflow platform [30] to implement the proposed method. The CNN model used in the pre-trained classifier is VGGnet provided in [12], which can transform each image in GERMS into a 4096-dimensional feature vector. The number of neurons in the last so f tmax layer of the pre-trained classifier is 136. In the policy µ θ psq network, the number of neurons and the activation function in the hidden layer are 1024 and relu; The last layer uses tanh activation function and has one neuron. In order to match the viewpoint range of GERMS, we multiply the output of tanh by 512, so that the next relative VP action range is r´45 0 , 45 0 s. In the policy σ θ psq network, the configuration of the hidden layer is consistent with that in µ θ psq; The number of neurons and the activation function in the last layer are 1 and so f tplus. The configuration of the hidden layer in the state value network V ω psq is same as that in µ θ psq. The reward discount factor γ is 0.96, and the GAE parameter λ is 0.95. The clipping ration parameter is empirically set as " 0.2 in the light of the original implementation of PPO [13]. The VP policy converges after 4200 episodes in the training process, therefore, we set L " 4200. N and the minibatch size M are all 128. K A and K C are 1 and 10. The maximum step T for recognition is set as T " 12. The Adam optimizer [31] is used for the optimization of the policy network and the state value network. The learning rates of them are 0.0001 and 0.0002. In the dynamic exploration, the parameters c div , σ 0 , T M , T W , σ ∆ and σ S are 0.3, 106, 3, 3, 14, and 14, respectively.

Ablation Study
To investigate the effectiveness of our dynamic exploration scheme, we intend to conduct the variant experiments with different components ablation. Table 2 shows the abbreviations and interpretations of different components. In the variant experiments, the components AERC, ER, and SSDN are gradually removed. The experimental results are presented in Figure 7, where the recognition accuracy is a function of the number of planned actions. From Figure 7, we can notice that the performance degrades heavily after removing the component AERC. The results of the experiments BL(σ " 0.1), BL+SSDN, and BL+SSDN+ER(c " 0.03) are similar. This is because their exploration ability is all at a low level. Although the experiment BL(σ " 100) has a high exploration ability, the VP policy can not converge to the optimal. So its result is slightly worse. The result of experiment BL+SSDN+ER(c " 0.3) is the most unsatisfactory, because its standard deviation increases explosively as shown in Figure 3. This study validates the effectiveness of our proposed adaptive entropy regularization based dynamic exploration scheme.

Dynamic Exploration Study
In our dynamic exploration scheme, the standard deviation σ is adapted by updating the VP policy parameters θ during the training. Another natural idea (i.e., independent linear decaying dynamic exploration, ILDDE) is to adjust σ independently of parameters θ. The idea is realized as where σ is a linear decaying function of the training time node t. σ 0 and σ L are the initial and final σ values, respectively. T L is the total training time. Therefore, the VP policy can be represented as π θ pa|sq " N pµ θ psq, ř ptqq where ř ptq " diagpσ 2 1 ptq, σ 2 2 ptq, ..., σ 2 d ptqq. In the training, the update of parameters θ only affects the policy mean, the policy standard deviation is independently adapted by (13). We experiment with this idea and compare it with our scheme. In the experiment, except that the independent network σ θ psq in Figure 2 is removed and replaced with σptq in (13), everything else is exactly the same. From the presented results in Figure 8, we can notice that the performance of our scheme is much better than that of ILDDE. This is because the VP policy corresponding to ILDDE is affected by two parameters: θ and t. However, t does not participate in the optimization process, which may make the learned policy worse and worse. However, in our scheme, the policy mean and standard deviation are only related to θ, and participate in the whole optimization process.

Comparison with the State-of-the-Art Methods
In this subsection, several baselines [10] and state-of-the-art VP approaches [11,12,16] are employed for experiment comparison with our continuous VP method, which are showed as follows: • Random policy [10] plans a random action from the action space t˘π 64 ,˘π 32 ,˘π 16 ,˘π 8 ,˘π 4 u with uniform probability; • Sequential policy [10] moves the agent to the next immediate position in the same direction; • DQL policy [11,12] exploits deep Q-Learning algorithm to learn a discrete VP policy. The discrete action space is t˘π 64 ,˘π 32 ,˘π 16 ,˘π 8 ,˘π 4 u; • E-TRPO policy [16] develops a continuous VP method which is implemented by trust region policy optimization [17] and extreme learning machine [18]. It represents the VP policy as a Gaussian model and learns the policy with a fixed exploration scheme.
For a fair comparison, the classifiers of different methods are the same in the experiment. The evaluation results of our VP model against other approaches are presented in Figure 9, from which we have the following observations: (1) Our proposed method achieve better performance compared with the state-of-the-art methods; (2) The performance of active policy is significantly better than that of passive policy. Random policy and Sequential policy are essentially passive VP policies. They do not actively plan the next viewpoint according to the information obtained from the previous viewpoints. However, DQL policy, E-TRPO policy, and the proposed method use the previous information to plan the next viewpoint, so they are active VP policies; (3) The performance of continuous VP policy outperforms that of discrete VP policy. DQL policy is a discrete VP policy while E-TRPO policy and our method are continuous VP policies. The continuous VP policy explores in the continuous viewpoint space and will not miss some important viewpoints; (4) Compared with the continuous VP method E-TRPO, our continuous VP model has better performance. This is mainly because we present an effective dynamic exploration scheme, which can explore more new viewpoints and find better solutions.

E-TRPO Ours
Sequential DQL Random E-TRPO Ours Sequential DQL Random Figure 9. Performance comparison between our proposed continuous VP method and several competing approaches.

Conclusions
In this work, we develop a continuous viewpoint planning method for active object recognition based on reinforcement learning. More specifically, the viewpoint planning policy is represented as a parameterized Gaussian model and learned using the proximal policy framework. We also design a dynamic exploration scheme based on adaptive entropy regularization to automatically adjust the viewpoint exploration ability in the learning process. Experiments on the public dataset GERMS show the superiority of our method.