Visual Pretraining via Contrastive Predictive Model for Pixel-Based Reinforcement Learning

In an attempt to overcome the limitations of reward-driven representation learning in vision-based reinforcement learning (RL), an unsupervised learning framework referred to as the visual pretraining via contrastive predictive model (VPCPM) is proposed to learn the representations detached from the policy learning. Our method enables the convolutional encoder to perceive the underlying dynamics through a pair of forward and inverse models under the supervision of the contrastive loss, thus resulting in better representations. In experiments with a diverse set of vision control tasks, by initializing the encoders with VPCPM, the performance of state-of-the-art vision-based RL algorithms is significantly boosted, with 44% and 10% improvement for RAD and DrQ at 100 steps, respectively. In comparison to the prior unsupervised methods, the performance of VPCPM matches or outperforms all the baselines. We further demonstrate that the learned representations successfully generalize to the new tasks that share a similar observation and action space.


Introduction
Recent advances in deep reinforcement learning (RL) have allowed agents to perform complex control tasks directly from raw sensory observations. Remarkable successes have been achieved, ranging from learning to play video games from raw pixels [1], solving complex tasks from first-person-view observations [2,3], to autonomously performing robotic tasks [4][5][6][7].
As a standard practice, deep RL methods jointly learn a visual encoder and a policy in an end-to-end manner. In this paradigm, the visual representations are learned under the strong supervision of task-specific rewards. While the simplicity of end-to-end methods is appealing, learning the representations relying on the rewards can have several limits. First, the representations are acquired harder under sparse rewards, thus requiring more data for convergence. Additionally, in practice, the reward function is commonly designed and retested until a suitable function is selected, resulting in the representation learning process being repeated, thus it is inefficient. Furthermore, whenever the agent encounters the new tasks, the representation learning is performed again even if the environment appearance is identical across tasks.
In this work, we pursue an alternative paradigm as depicted in Figure 1, wherein the visual encoder is first pretrained without any reward supervision, detached from the policy learning, and then the learned task-agnostic representations are transferred for learning policy on a specific task, under the reward supervision. This two-stage training enables the representations learned in an environment to be reused for other tasks that use the same environment, such as the cheetah environment, containing moving forward and backward tasks. Furthermore, the policy learning stage also requires fewer samples for training, and is thus more sample efficient. The end-to-end training paradigm (left) and our two-stage training (right). In the end-toend training, the visual encoder is jointly trained with the policy under the supervision of rewards from the environment. In contrast, our method detaches the representation learning from the policy learning. In the first stage, the visual encoder is trained with the proposed contrastive predictive model without rewards and is frozen. Then, in the second stage, given some tasks, the policy is trained by reusing the frozen encoder.
In the first stage, a natural choice is pretraining the visual encoder on a large dataset such as ImageNet. However, previous works show that naively pretraining in such a dataset does not lead to a significant impact [8][9][10]. This ineffectiveness might stem from the domain discrepancy, and more importantly, the pretraining data is lacking in reflecting natural dynamics relation, i.e., the Markov decision process (MDP) property, which is inherent in RL tasks. As we show later in our experiments, pretraining the visual encoder with even in-domain data without considering dynamics is still underperformed. Early works [11][12][13] commonly learn the compact representation by reconstructing the pixels in the current or subsequent frame, which is very challenging with high-dimensional observations. Recently, ref. [14] introduced the contrastive prediction to bypass the use of reconstruction-based prediction. However, this method is lacking in capturing the environment dynamics, thus resulting in low performance in complex tasks such as Cheetah or Walker. To mitigate this shortcoming, ref. [15] introduced augmented temporal contrast (ATC), which additionally trains a marginal forward dynamic model as an auxiliary task for learning representation. Despite effectiveness, the proposed marginal forward model does not condition actions while learning, thus making ATC focus on temporal coherence between observations rather than modeling environment dynamics. To address these issues, this paper introduces a method referred to as the visual pretraining via contrastive predictive model (VPCPM). Specifically, VPCPM utilizes the reward-free data from the environment to learn the visual encoder by jointly optimizing the forward and inverse dynamic models through the contrastive objective. The visual encoder firstly maps the observation into the latent state. Given a latent state and a latent action, the forward model predicts the next latent state. Meanwhile, the inverse model infers the executed action given the consecutive latent states. Learning the forward and inverse models serves as the constraint to enable the visual encoder to consistently follow the underlying dynamics of the environment.
The proposed method is evaluated on a diverse set of image-based continuous control tasks from DeepMind Control Suite [16]. The experiment results show that VPCPM consistently improves the performance and sample efficiency of the state-of-the-art visionbased RL algorithms. Specifically, at 100k environment step, VPCPM improves the mean returns 44% over reinforcement learning with augmented data (RAD) [17], 10% over dataregularized Q (DrQ) [18]. Compared to prior unsupervised pretraining methods, VPCPM matches or outperforms all the baselines as tested across all environments. Moreover, the investigation of unseen tasks shows that the VPCPM-initialized encoder successfully generalizes to unseen tasks that operate in the same environment.
The rest of the paper is organized as follows: Section 2 discusses related works, while Section 3 describes setup in vision-based RL and the related base algorithm. Section 4 details our proposed method. Section 5 presents extensive experiments of the proposed methods, while the conclusion of our paper is presented in Section 6. the RL objective simultaneously, which can be considered as learning from scratch, in an end-to-end manner. By contrast, our framework considers training in two stages, and we further propose a new self-supervised auxiliary objective in the pretraining stage to provide a meaningful representation for directly using or further fine-tuning in the testing environment.
Unsupervised Pretraining Representation for RL. Pretraining the representations without supervision using an unsupervised or self-supervised framework is a common practice in other fields, such as natural language processing (NLP) [55][56][57] or computer vision [42][43][44]50,58,59]. These studies consider effective ways to learn the visual encoder from massive unlabeled data and reduce sample complexity when learning a new task. In deep RL, early studies attempted to pretrain the visual encoder by using pixel-reconstruction task [11,12] or object detection task [60]; the pretrained representations are then fine-tuned on a specific task. Some recent studies [8][9][10] showed that naively pretraining on ImageNet is not helpful for performing the downstream RL tasks. It hints that the visual encoder of the RL agent would be pretrained on the data closely related to its environments. Along this line of research, ref. [14] attempted to use their proposed objective, originally designated for online training, to learn the visual encoder detached from policy learning. However, due to the lack of encoding of the environment dynamics, the performance was limited in complex tasks such as Cheetah. The study closest to ours is [15], which also attempted to learn the encoder detached from the RL objective. However, this method implicitly estimates the next observation by marginalizing over actions. In contrast, our method relies on the forward and inverse dynamics models with contrastive loss, where the proposed forward model is conditioned on action such that in the latent space, the representation of states and action consistently follow the underlying Markov decision process of the environment.
In addition to pretraining the visual encoder, there is also another line of research that tries to pretrain the policy with self-supervised intrinsic rewards [38,[61][62][63][64][65]. In this setting, the agent is first allowed to freely interact with the environment for a long period without access to extrinsic rewards, then it is exposed to task-specific rewards to learn on downstream tasks. The intrinsic reward is commonly formulated to encourage the RL agent to gain new knowledge about the environment [38,61,62], maximize diversity of collected data [9,66], or learn diverse skills [67][68][69][70]. For the vision-based RL tasks that are applicable in this setting, the visual encoder is concurrently trained with the policy during pretraining. Recently, ref. [9] proposed to use a particle-based estimator [71] to estimate entropy for observations, with its representations learned by using contrastive loss from SimCLR [44]. Alternatively, ref. [66] proposed a self-supervised pretraining scheme that allows detaching the representation learning from exploration (i.e., learning from intrinsic rewards) to enable the generalization of representations for unseen tasks. In this method, the representations are learned by a variant of clustering-based contrastive loss SwAV [72]. These works are promising to acquire the general policy as well as the generalized representations. However, they are still required to freely interact with the environment during pretraining, which is potentially unsafe in the real world. In contrast, our framework allows us to learn the encoder entirely from an offline dataset, thus resulting as safer and enabling the reuse of past data.

Background
In this section, the framework for vision-based reinforcement learning is presented together with a representative off-policy model-free algorithm, soft actor-critic.

Reinforcement Learning from Images
The problem of solving control task from high-dimensional observations is formulated as a partially observable Markov decision process (POMDP) [73,74], which can be defined as a tuple (O, A, p, r, γ). Here, O is the high-dimensional observation space, A is the action space, the transition dynamics p = Pr(o t+1 |o ≤t , a t ) represent the probability distribution over the next observation o t+1 given the history of previous observations o ≤t and current action a t , the reward function r : O × A → R that maps the current observation and action to a reward r t = r(o t , a t ), and γ ∈ [0, 1) is a discount factor. Following common practice [1], the POMDP is reformulated as an MDP [73] by stacking consecutive observations into a state s t = {o t , o t−1 , o t−2 , . . .}. For simplicity of notation, the transition dynamics and reward function are redefined as p = Pr(s t+1 |s t , a t ) and r t = r(s t , a t ), respectively. The goal of RL is to find a policy π(a t |s t ) that maximizes the expected return defined as the total of accumulated reward E π ∑ T t=0 γ t r t |a t ∼ π(.|s t ), s t+1 ∼ p(.|s t , a t ), s 0 ∼ p 0 (.) , where T is the length of episode and p 0 is the probability distribution of initial state.

Soft Actor-Critic
Soft actor-critic (SAC) [75] is an off-policy actor-critic method based on the maximum entropy RL framework [76], which encourages the exploration and robustness to noise by maximizing a weighted objective of the reward and the policy entropy. To update the parameters, SAC performs the soft policy evaluation and improvement steps. The soft policy evaluation step fits a parametric Q-function Q(s t , a t ) using transitions from the replay buffer D by minimizing the soft Bellman residual: (1) The target value functionV is approximated via a Monte Carlo estimate of the following expectation: whereQ is the target Q-function, with its parameters obtained from an exponentially moving average of the Q-function parameters for stabilizing training. The soft policy improvement step then updates the stochastic policy π by minimizing the following objective: In this work, the learnable version of temperature α is used instead of the pre-fixed value, which is optimized by the following objective: whereH ∈ R is the target entropy hyperparameter that policy attempts to match, which in practice is usually set equal to -|A|. SAC is one of the state-of-the-art RL algorithms for continuous control [75]. It is also widely used as a backbone for solving vision-based control tasks [5,14,17,18,36,46]. In this work, we adopt RAD [17] and DrQ [18], which are built on top of SAC, for policy learning.

Method
In this section, the proposed visual pretraining via contrastive predictive model (VPCPM) is described. The proposed method can be used to pretrain the visual encoder, which is then utilized for policy learning by common model-free vision-based RL algorithms.

Network Architecture
The control policy network takes the input as the state s t and outputs the action a t . It consists of the visual encoder π e parameterized by φ and the policy π a parameterized by θ a , as depicted in Figure 1. This design enables the encoder to be trained independently with the policy part, i.e., without requiring the RL objective for training. The goal of the proposed method is to learn useful representations from an amount of given data without rewards, such that π a can be efficiently trained on top of that to solve RL tasks.

Visual Pretraining via Contrastive Predictive Model
VPCPM introduces a useful prior for the vision-based RL training procedure by enforcing the representations not only representing the semantic information but also conforming to the basis of dynamics from the environment. During the pretraining stage, for a given environment, it is assumed that there is a pre-collected dataset D consisting of N transitions without task-specific rewards: (s The visual encoder is desired to effectively encode the semantics and consistently follow dynamics only from the primitive elements, i.e., observations and actions. An overview of the proposed method is shown in Figure 2. The visual encoder π e : O → Z learns the mapping from the observation into the latent space. VPCPM alternates between learning the forward dynamic model (forward step) and inverse dynamics model (inverse step) while optimizing the underlying encoder π e (Algorithm 1). At the forward step, the forward model F parameterized by ψ takes the inputs as the current latent state and the latent action in predicting the next latent state. To optimize F together with π e , the InfoNCE loss [42] is employed, which contrasts between the predicted next latent state and the ground truth. Formally, let f : Z × Z → R be a similarity metric; the objective of forward model is described as: whereẑ ) is the target latent state without encoder parameter update. The expectation is computed over K samples of (s t , a t , s t+1 ). Operating in the latent space bypasses the prediction of the forward model in pixel space, which would be extremely challenging given the large uncertainty in pixel prediction. The use of InfoNCE helps to learn the discriminative representation, where the dissimilar states are repelled and the similar states are pulled close. Additionally, this objective also prevents trivial collapsed solutions in which the constant features are obtained for every state. Ƹ +1 ො +1 +1 ത Figure 2. VPCPM for the encoder π e : At the forward step, the forward model F takes the inputs as the latent state z t and the latent actionā t for predicting next latent state z t+1 . In this step, both F and π e are optimized together. At the inverse step, the inverse model I takes the inputs as two latent states z t and z t+1 for predicting the action a t . In this step, both I and π e are optimized together.

Algorithm 1 Visual pretraining via contrastive predictive model (VPCPM)
, the encoder's parameters φ, batch size K 2: Output: The encoder's parameters φ 3: Initialize: The parameters φ, ψ, ρ 4: for k = 1 to ∞ do 5: Compute the latent states: Train forward model: Equation (5) 8: Train inverse model: In the inverse step, the inverse model I parameterized by ρ takes the states before and after transition and predicts the action in between. In this work, the inverse model operates in the latent space extracted by the visual encoder π e . The encoder is jointly optimized with the inverse model by minimizing the following objective: For continuous action, can be defined as the mean squared error or mean absolute error between ground truth and the predicted action. When predicting the action, the inverse model pays attention to the controllable features and the temporal difference between consecutive states in latent space, which also encourages the encoder to capture the discriminative features. Pretraining the representations satisfying both forward and inverse dynamics models strengthens the relations of state and actions in latent space, establishing an initialization point of the encoder for further fine-tuning inside a region of parameter space in which the parameters are henceforth restricted.

Experiment Setup
The proposed method is evaluated on a diverse set of the image-based control tasks from DeepMind Control suite [16], which was recently considered as a standard for benchmarking the sample efficiency of RL from images [14,17,77,78]. DMControl consists of different robot models (environments), where each model can be associated with a particular MDP representing a specific task. The selected benchmark includes six environments from the PlaNet benchmark [77], as shown in Figure 3, where the action repeat is treated as a hyperparameter (Table 1). The settings of visual observation follow [14,17,18,36], which consider a stack of three consecutive 84 × 84 RGB renderings as a state. Figure 3. VPCPM is benchmarked on six image-based control environments from the DeepMind Control Suite [16]. The order of environment from lowest to the highest dimension of action: Cartpole, Ball in cup, Reacher, Finger, Cheetah, and Walker. Each task offers a unique set of challenges, including complex dynamics, sparse rewards, and hard exploration. For the vision-based RL algorithm, we use two state-of-the-art methods including RAD [17] and DrQ [18], which are based on soft-actor critic (SAC) [75]. The network architecture used is identical to [17,18,36]. Unless stated otherwise, the configurations of the algorithm are as follows: the actor and critic neural network are trained using Adam optimizer [35] and a mini-batch size of 512. For SAC, the initial temperature is 0.1, the soft target update rate τ is 0.01, and the target network and the actor updates are performed every two critic updates similar [17,18,36]. The random cropping [14,17] is used as image augmentation during pretraining. The learning rate for actor, critic, and the parameter α of RAD and DrQ is followed by the setup from each method. In our CPM, the forward model is parameterized by four 50-d hidden layers followed by ReLU activation except the last, and the inverse model is parameterized by three 1024-d hidden layers followed by ReLUs, except the last layer, which uses tanh to normalize the actions. The action is encoded by an MLP consisting of two 50-d hidden layers followed by ReLU except for the last. Input to the forward model MLP is a concatenation of the current latent state and the current encoded action. Input to the inverse model MLP is a concatenation of the current and next latent states. Both the forward and inverse model have separated Adam optimizer [79]. During the pretraining stage, the encoder, inverse, and forward model are trained with a learning rate of 1 × 10 −4 for Walker, 2 × 10 −4 for Cheetah, and 1 × 10 −3 otherwise, with the batch size of 512.
The performance of the agent is evaluated across five seeds; for every seed, the average returns of 10 episodes are computed every 10k environment steps. The figures plot the mean performance together with ±1 standard deviation shading. The performance is reported over the true environment steps as a common practice [14,17,18,36,78], thus are invariant to the action repeat hyperparameter. Throughout experiments, pretraining data are collected by a random policy. Specifically, for Cheetah and Walker domain, 50k transitions are collected, and 25k for the others. The encoder is pretrained within 50k iterations for Cheetah and Walker, and 25k otherwise, which corresponds to one update step per transition. The full set of parameters is shown in Table 1.

Effects of Pretrained Representation
In this section, the effectiveness of VPCPM in pretraining the visual encoder for different vision-based algorithms is investigated. Two state-of-the-art algorithms, including RAD [17] and DrQ [18], are evaluated, where the hyperparameters reported in each method are used. The random crop augmentation from each method is applied. Crop image augmentation in RAD is formed by cropping 84 × 84 frames from an input frame of 100 × 100, while in DrQ, the 84 × 84 frames are padded each side by ±4, then cropped back to the 84 × 84 size. Moreover, in the RAD paper, there are some tasks using translation augmentation; we instead use crop augmentation across tasks, thus the results may be varied. The parameters of the pretrained encoder are fine-tuned by the base RL algorithm in a specific task. Figure 4 and Table 2 compare these methods with and without pretraining. We provide the result at both 100k and 500k steps as common report for DMControl [14,17,18,46]. At 100k step, VPCPM enhances over RAD ranging from 13 to 118%, with the largest magnitude on Ball in cup-catch. For DrQ, the enhancements are in the range from 6 to 20%, with the largest magnitude on Reacher-easy. From Figure 4, the improvement is shown clearer at the early stage of training on the sparse reward tasks such as Ball in cup-catch and Finger-spin. The reason is that these tasks usually failed to complete the task at the beginning, thus observing less reward signal to learn the visual representation. With our VPCPM-initialized representations, the policy part can be quickly learned, resulting in significantly accelerating learning progress. For the Cartpole-swingup task, the action space is very small (with the dimension of one); thus, with the well-presented representation of states, the task can be quickly solved. The tasks including Reacher-easy, Walker-walk, and Cheetah-run are more challenging because of the exploration problem, thus requiring more samples to complete the tasks even with good representations, resulting in a lower magnitude of the improvement. Overall, our method improves 44% over RAD and 10% over DrQ at 100k steps. At 500k steps, the enhancement is smaller, with 3.4% over RAD and 2.9% over DrQ. This is because the base algorithms are almost converged around 500k; thus, the effect of pretraining is moderate.  Figure 4. The performance on six tasks from PlaNet benchmark [77]. Pretraining the visual encoder by VPCPM consistently improves performance and sample efficiency across all environments.

Comparison with Prior Methods
In this section, the comparison of pretraining by using different unsupervised learning methods is conducted. VPCPM is compared against two representation learning approaches: non-model-based and model-based. The non-model-based representation learning approach includes (i) reconstruction loss as in VAE [35], and (ii) the contrastive loss from single observation as in CURL [14]. The model-based approach includes (iii) a simple predictive model (PM), where the forward and inverse model are learned by mean square error loss, (iv) augmented temporal contrast (ATC) [15], where the marginal forward model, i.e., without conditioning on actions, is parameterized by a residual network and learned using contrastive loss, and (v) predictive coding-consistency-curvature (PC3) [80], where the forward model is learned by the weighted sum of three losses: contrastive loss, mean square error loss, and low curvature loss. In PC3, the current latent state-action pair (z t , a t ) is used as the source of negative samples and used for the contrastive prediction of the next latent state z t+1 . In contrast, we use the predicted next latent stateẑ t+1 as the source of negative samples. For the implementation of ATC, we use our implementation with the modification as follows, disable the inverse model, and remove the action input of the forward model. For PC3, we use the author's provided code (https://github.com/VinAIResearch/PC3-pytorch.git, accessed on 17 March 2022). For a fair comparison, the same amount of samples is used during pretraining. The procedure for evaluation is similar to the previous section, but only the RAD algorithm is considered.
The performance in Figure 5 shows that RAD initialized by VPCPM outperforms across all environments. These improvements suggest the importance of imposing the dynamics to the visual encoder during pretraining, which is lacking in the methods that only focus on semantic information such as reconstruction and contrastive. In comparison to the simple PM, the proposed method benefits from the contrastive objective. Indeed, the inverse model is limited since it cannot capture the changes in the sensory stream beyond the agent's control, and the use of contrastive is helpful to prevent this degeneracy. Moreover, learning in a contrastive manner represents states more discriminative in the latent space. Compared with ATC, VPCPM shows the importance of the action-condition forward model together with the inverse model in learning the controllable features. PC3 is originally designed to use for model-based planning algorithms such as iLQR, which requires the system to be locally linear. Thus, the features learned by PC3 might not suitable for vision-based RL algorithms in the highly nonlinear system, as in our considered environments. Indeed, the results show that the representations from VPCPM are more useful for vision-based RL algorithms. Overall, the sample efficiency in deep RL should be attained from representations that are discriminative and follow the dynamics.

Effects of Components during Pretraining
Ablation tests were performed to determine the effects of individual components in VPCPM. The performances of RAD with the encoder pretrained by using the contrastive forward dynamic model (cFDM), inverse dynamic model (IDM), and both of them (VPCPM) are shown in Figure 6. Overall, the base RL agent benefits from pretraining by any type of dynamic model, but the cFDM shows more impacts. Together with the constraint of IDM, the proposed method significantly accelerates the sample efficiency of the base algorithm. Moreover, training the visual encoder together with FDM purely in latent space does not suffer from the collapsed problem, where the encoder outputs a constant across states.

Generalization over Unseen Tasks
In this section, the generalization of the pretrained encoder for unseen tasks is examined. Specifically, the encoder pretrained from a source task is used for the unseen target tasks. Subsequently, the RAD agent is trained on top of the pretrained encoder until converged. The considered tasks are shown in Table 3. The target tasks are different on the reward function but share the same observation space, except for Reacher-hard, where the size of the visual indicator is different (see Figure 7). The performance of the base agent is evaluated in both "fine-tuning" and "frozen" settings, where the pretrained representation is frozen or fine-tuned. The results are averaged across five seeds and compared against the agent learning from scratch. Table 3. The encoder pretrained with data from a source task is reused for new target tasks. Almost the encoders are successfully generalized to target tasks, except for Reacher-hard, needing to fine-tune.

Source Task Target Task
Cartpole-swingup Cartpole-swingup_sparse Cartpole-balance Cartpole-balance_sparse Walker-walk Walker-stand Reacher-easy Reacher-hard The results are shown in Figure 8. In almost all tasks, the frozen representation is sufficient in learning optimal policy. When fine-tuning from the pretrained initialization, the performance is slightly improved. The major exception is the Reacher-hard task, where the frozen encoder significantly underperforms. However, the fine-tuning encoder shows more sample efficiency than that learning from scratch. The difference in the observation space causes this downgraded performance, i.e., the different size of the target indicator. The enhancement in the performance of the base RL agent shows that VPCPM is successfully learning the abstract features without reward supervision.

Pretraining with Classification
To show the importance of imposing the dynamics in representation learning, we investigate the case where the visual encoder is trained to capture the semantic only, without knowing about the dynamics. To investigate that, we consider six-way classification which corresponds to six robotic models from DeepMind Control Suite [16], as indicated in Figure 3. The dataset is generated by an expert policy. In each class, the training and test sets contain 50k and 10k samples, respectively. In total, there are 300k samples for training and 60k samples for testing. The visual encoder and the classifier are trained using Adam optimizer [79] with the learning rate 3 × 10 −4 , and β = (0.9, 0.999). We use the data augmentation methods used in [43] and the random crop [17]. The pretrained encoder is then frozen and used for policy learning. The results are shown in Figure 9. In Ball in cup-catch and Cartpole-swingup, pretraining by classification slightly improves performance, while other tasks show no gain, or even degraded performance. The results indicate the importance of making the representation encode dynamic information for learning representation offline for RL tasks.

Conclusions
In this paper, a new self-supervised representation learning method is proposed to pretrain the visual encoder for the vision-based RL. By leveraging plenty of reward-free data, the proposed method successfully learns the meaningful initial representations that provide sufficient information and consistently follow underlying dynamics from an environment. Experimental results show that the state-of-the-art vision-based RL algorithms benefit from our method, with the gain of 44% over RAD and 10% over DrQ at 100k steps. Additionally, we benchmark several leading self-supervised methods for pretraining visual encoders. The results show that the performance of the policy learned on top of the VPCPM-trained encoder matches or outperforms all others. Furthermore, the independence of task-specific rewards during pretraining allows our learned representations to be reused for different tasks sharing similar observation and action space.
In this paper, we investigate the effectiveness of pretraining the visual encoder, in which the testing and training environments are similar. However, this condition is brittle in practice. Future works should improve the robustness of the pretrained representation such that it is invariant to the visual distractions from the environment such as variations in background, color, and camera pose.