A Reinforcement Learning Algorithm for Automated Detection of Skin Lesions

: Skin cancers are increasing at an alarming rate, and detection in the early stages is essential for advanced treatment. The current segmentation methods have limited labeling ability to the ground truth images due to the numerous noisy expert annotations present in the datasets. The precise boundary segmentation is essential to correctly locate and diagnose the various skin lesions. In this work, the lesion segmentation method is proposed as a Markov decision process. It is solved by training an agent to segment the region using a deep reinforcement-learning algorithm. Our method is similar to the delineation of a region of interest by the physicians. The agent follows a set of serial actions for the region delineation, and the action space is deﬁned as a set of continuous action parameters. The segmentation model learns in continuous action space using the deep deterministic policy gradient algorithm. The proposed method enables continuous improvement in performance as we proceed from coarse segmentation results to ﬁner results. Finally, our proposed model is evaluated on the International Skin Imaging Collaboration (ISIC) 2017 image dataset, Human against Machine (HAM10000), and PH 2 dataset. On the ISIC 2017 dataset, the algorithm achieves an accuracy of 96.33% for the naevus cases, 95.39% for the melanoma cases, and 94.27% for the seborrheic keratosis cases. The other metrics are evaluated on these datasets and rank higher when compared with the current state-of-the-art lesion segmentation algorithms.


Introduction
The largest organ in the human body is the skin. The disorganized and uncontrolled growth of skin cells lead to skin cancer formation and cancer can rapidly grow to other body parts. Skin cancer is a common kind of cancer worldwide. The deadliest form of skin cancer is melanoma, and its prevalence has been rapidly rising in the last 30 years [1]. Early diagnosis can increase one's chances of survival. Identification of the melanoma or suspected skin lesions is conducted by dermoscopy imaging, by detecting the pigmented skin lesions. The technique is non-invasive and detects possible lesions in the early stage. Because of the higher resolution of dermoscopic images and better visualization capabilities, dermatologists can use their own eyes to examine skin lesions. The decisionmaking process is time-consuming, requires a high degree of expert knowledge, and is biased (i.e., depending on the dermatologist's viewpoint). Convolutional neural networks (CNN) can detect melanoma in the same manner that dermatologists can [2], suggesting the potential for automated skin lesion analysis.
Automated skin lesion analysis is an essential part of computer-assisted diagnosis [3,4]. Existing artificial intelligence (AI) algorithms do not do an excellent job of adequately considering this clinical frame of reference. The diagnostic accuracy (Acc) can increase if The clinicians then fine-tune the coarse segmentation procedure in the form of a multi-step process. This segmentation approach is similar to interactive segmentation [22]. Since it includes painting many strokes with a brush on the foreground and background or creating a box around the foreground, the interaction is considered prior knowledge for segmentation. This strategy gathers past information and improves the segmentation algorithm's efficacy by interacting with the user. Inspired by these concepts, the proposed method is proposed as a multi-step segmentation process. Our method has the advantage of improving the segmentation performance by automatically gathering the previous knowledge, thus eliminating the need for human involvement.
The segmentation process is proposed as an MDP. In the next step, the segmentation mask predicted previously is taken as prior knowledge in the following step during the segmentation process. The agent executes an action for the segmentation process in every step that depends on the current segmentation mask and the input image. Our method derives its inspiration from the stroke-based stylization approach [23]. We propose a segmentation executor that helps draw a The clinicians then fine-tune the coarse segmentation procedure in the form of a multi-step process. This segmentation approach is similar to interactive segmentation [22]. Since it includes painting many strokes with a brush on the foreground and background or creating a box around the foreground, the interaction is considered prior knowledge for segmentation. This strategy gathers past information and improves the segmentation algorithm's efficacy by interacting with the user. Inspired by these concepts, the proposed method is proposed as a multi-step segmentation process. Our method has the advantage of improving the segmentation performance by automatically gathering the previous knowledge, thus eliminating the need for human involvement.
The segmentation process is proposed as an MDP. In the next step, the segmentation mask predicted previously is taken as prior knowledge in the following step during the segmentation process. The agent executes an action for the segmentation process in every step that depends on the current segmentation mask and the input image. Our method derives its inspiration from the stroke-based stylization approach [23]. We propose a segmentation executor that helps draw a brushstroke for the input segmentation mask to designate the ROI. A neural network is used in the segmentation executor, which converts the continuous action parameters to brushstrokes. It can be used in different ways, and after some specific number of phases, the final segmentation results are achieved. The location and shape to get the fine-grained segmentation is established by using the continuous action parameters set. brushstroke for the input segmentation mask to designate the ROI. A neural network is used in the segmentation executor, which converts the continuous action parameters to brushstrokes. It can be used in different ways, and after some specific number of phases, the final segmentation results are achieved. The location and shape to get the fine-grained segmentation is established by using the continuous action parameters set. Deep Q-Networks (DQN) [24] is one of the most often utilized DRL algorithms. However, it is only restricted to solving the discrete action space problems. We use the Deep Deterministic Policy Gradients (DDPG) [25] algorithm for solving the continuous action space. DDPG is a learning method that simultaneously learns a policy and Qfunction. Before employing the Q-function to learn the policy, it uses the off-policy data and Bellman's equation. To the best of our knowledge, this is the first attempt to represent and solve the skin lesion image segmentation problem as an MDP using DDPG. Since the DDPG is heavily dependent on searching the correct hyperparameters for the current task, we choose the action bundle, a suitable hyperparameter for the algorithm, thus increasing its stability.
The following are the main contributions of this work: • The skin lesion image segmentation is proposed as an MDP. It is solved with the DDPG algorithm, similar to how the physicians delineate the lesion image ROIs. Deep Q-Networks (DQN) [24] is one of the most often utilized DRL algorithms. However, it is only restricted to solving the discrete action space problems. We use the Deep Deterministic Policy Gradients (DDPG) [25] algorithm for solving the continuous action space. DDPG is a learning method that simultaneously learns a policy and Qfunction. Before employing the Q-function to learn the policy, it uses the off-policy data and Bellman's equation. To the best of our knowledge, this is the first attempt to represent and solve the skin lesion image segmentation problem as an MDP using DDPG. Since the DDPG is heavily dependent on searching the correct hyperparameters for the current task, we choose the action bundle, a suitable hyperparameter for the algorithm, thus increasing its stability.
The following are the main contributions of this work: • The skin lesion image segmentation is proposed as an MDP. It is solved with the DDPG algorithm, similar to how the physicians delineate the lesion image ROIs.

•
The proposed skin image segmentation executor is based on the quadratic Bezier curve (QBC) and uses the action bundle as a hyperparameter to further improve the Acc of the segmentation process.

•
We use a modified experience replay memory (ERM) to train the segmentation agent efficiently. The ERM helps in efficiently utilizing the previous experiences by learning multiple times.

•
We perform a quantitative statistical analysis of our skin lesion segmentation results to show the reliability of our segmentation method and compare our results to the current state-of-the-art approaches.
The structure of the article is as follows: Section 2 describes the current state-of-theart methods, Section 3 presents our proposed RL method, overview, and the details of the experimental setup; Section 4 presents the results and discussion of our method. Finally, we conclude our article in Section 5.

Related Work
Many strategies used in the skin lesion segmentation are established in the literature, including region-merging-based approaches [26], active contour models [27][28][29], thresholding-based methods [17]. Many conventional methods [28,29] based on morphological processes and clustering algorithms are proposed in the literature. The skin lesion is split into the foreground and background regions using K-means clustering by Jafari et al. [30]. Similarly, Ali et al. [31] suggested that skin lesions be segmented using fuzzy C-means (FCM). The contour is produced regularly in another important class of techniques called the active contour models [27][28][29] as it approaches the pigmented regions boundaries. After generating candidate regions using threshold-based methods, the active contour models are directed by multi-direction gradient vector flow (GVF) snake [29], local histogram fitting energy [26], and can be used to enhance the course segmentation.
On the other hand, the traditional methods often use complex pre-and post-processing processes and a slew of data-dependent intermediate stages. Consequently, the performance of the conventional method primarily depends on these phases, requiring the design step to be done carefully when working with a variety of datasets. They will fail if the boundaries of the pigmented regions are unclear, and the skin conditions are complex. Deep CNN models have excelled in several computer vision applications [32][33][34], including advanced skin lesion segmentation. In general, convolution and pooling methods are used in basic CNN models. Deeper neural networks can extract more semantic and abstract characteristics using the learned kernels (e.g., components and shape).
The output feature maps of classification neural networks often shrink over time (by subsampling). Consequently, a probability vector with values ranging from 0 to 1, and a dimension equal to the number of categories, is generated. This is an encoding method in which the abstract and semantic properties encode the images as the neural network grows in depth. A segmentation neural network has a similar fundamental structure to a neural network classifier. Still, it also has a decoding route that attempts to improve the output resolution (through upsampling), such that the output segmentation mask size matches the input image size. Based on the above, Jafari et al. [35] proposed segmentation as a classification problem for skin lesion analysis. The image patches inputs of various sizes centered on a single pixel, and the output for that pixel is the projected label. In this scenario, the consideration of pixel context information present locally is adopted. Since this method relies on pixel-level prediction, dense prediction is needed, and then the research moved towards combining decoding pathway CNN to do the lesion segmentation. Due to its success, Ronneberger et al. [36] created the popular U-Net, which is extensively utilized in medical image segmentation applications.
Several U-Net-based melanoma segmentation and classification methods have been proposed [37][38][39]. Liu et al. [11] used the dilated convolution after the finish of every convolutional block present in the original U-Net to extend the proposed technique receptive field. Abhishek et al. [40] enhanced performance by integrating and choosing different color bands dependent on color changes. Yuan et al. [41] proposed a framework based on convolution-deconvolution. A Jaccard distance-based loss function was considered apart from the conventional cross-entropy loss. Al-Masni et al. [42] developed a full resolution convolutional neural network (FrCN) that learns the properties of full resolution for each input data individual pixel without subsampling. Bi et al. [43] proposed training the distinct CNN models for every class known using the category information. The hierarchical development model-based stepwise integration (PSI) model was used to improve the output of lesion segmentation. Sarker et al. [44] proposed pyramid-pooling networks with dilated residual networks to segment skin lesions. The combination of endpoint error loss results in negative log-likelihood sharp boundaries. Xie et al. [16] suggested skin lesion segmentation as a mutual bootstrapping CNN method and classification, in which one job bootstraps the other.
Long et al. [45] first suggested a fully convolutional network (FCN) with a skip architecture based on a standard classification network to segment an entire image swiftly. Karthik et al. [46] used the Leaky ReLU and FCN in the final model layers frameworks to separate the ischemic lesions. Milletari et al. [47] proposed a V-Net-based architecture to segment the medical images in 3D and 2D formats. Many interactive segmentation algorithms are developed, with the physicians assisting the whole segmentation task. Olaf et al. [36] proposed a CNN and FCN-based biomedical image (cell) segmentation architecture. The classification network, led by a coarse segmentation network trained explicitly for this purpose, is guided by the expected coarse mask. Simultaneously, classspecific localization maps are produced by classification activation mapping (CAM). Then the concatenation is done into a U-Net-like network to improve the coarse mask prediction. DEXTR (deep extreme cut) [34] showed that using extreme points (contours corner points) as CNN input may improve the nature images instance segmentation results [34]. On the other hand, reference [34] uses extreme point inputs, the quality of which define the segmentation's efficiency. According to research [48,49], the auxiliary function of boundary/edge prediction helps in instance segmentation.
A loss function based on the Dice index is proposed to enhance the segmentation network. In addition to FCN-based techniques, several DL-based image segmentation algorithms have been proposed, including polygon-RNN [50], DeepLab V3+ [51], and multi-task network cascades [52]. In recent years, novel approaches for various applications, such as area extraction [53], wound intensity correction [54], and automated lung nodule categorization, have been developed [55]. Although there are positive effects of the therapies discussed above, a few pieces of literature have examined how physicians compute the ROI in skin imaging. The RL helps in imitating the demarcation technique of a physician. RL is significantly progressing in many applications, by combining RL with DL, DQN [24], DDPG [25], proximal policy optimization (PPO) [56], and asynchronous advantage actor critic (AC) are examples of deep neural networks used in DRL techniques for agent training. DeepMind achieves human-level game-playing abilities using DRL [23]. Therefore, other researchers have begun to use DRL for a range of problems, such as recommendation systems [57], game simulators, the Internet of Things, and adaptive packet scheduling [58]. DRL methods show promise in image classification, landmark identification [59], object localization [60], visual navigation [61], large -scale 3D point clouds semantic parsing [62], and face recognition [63].
Sahba et al. [64] developed a system for the image segmentation of prostate images based on RL Q-learning [65], which helps in finding the best values suitable for the subcategory of images and enhances the extraction of the ROI from the image. In contrast, Q-learning is limited to a narrow set of states and actions. Several researchers have tried to use DQN for image segmentation in recent years, combining Q-learning with CNN. DeepOutline [66] is an end-to-end deep RL framework for semantic image segmentation that works similarly to a user sketching the outlines of objects in an image with a pen. This approach is also proposed as an MDP. SeedNet is a game-changing seed generation method for interactive segmentation [67]. In each of these methods, DQN is utilized for training an image segmentation agent. DQN, on the other hand, is unable to maintain continuous activity, demanding additional operations to address the issue. In this article, we use the DDPG algorithm directly to segment lesions to save time and effort.

Proposed Method
This section addresses the public available skin lesion datasets, preparation of the ground truth images, and our proposed RL method. The ISIC-2017 Skin Lesion Challenge dataset [12] and the PH 2 dataset [13] and Human against Machine (HAM 10,000) [68] are three public datasets used by our method. In addition, we scaled all of the images to a size of 361 × 256 pixels to increase Acc and reduce computational costs.

ISIC-2017 Segmentation Dataset
The ISIC is a leading organization in terms of the availability of skin lesion image datasets. In addition, it provides expert annotations for the lesion images that can be used by several automated computer-aided diagnosis (CADx) applications. These applications use these datasets to detect melanoma and other cancers. This organization holds annual skin lesion competitions to inspire more researchers to develop CAD applications to identify lesions and promote skin cancer awareness [27]. The ISIC 2017 skin lesion dataset includes 2750 images, with 2000 in the training set, 150 in the validation set, and 600 in the test set. The algorithms must attain high Sensitivity (Sen) and Specificity (Spe) values to ensure that the lesions are correctly segmented. Unfortunately, when the ISIC challenge 2018 [27] was held previously, they did not release the ground truth of their training dataset. As a result, we focus our evaluation on the ISIC-2017 dataset.

PH 2 Dataset
The PH 2 dataset contains 200 images, 160 of which are naevus (both atypical and normal naevus) and 40 melanomas [13]. The ground truth in this dataset offers the true and precise boundaries of skin lesions. This dataset acts as an alternative test dataset for DL models trained on the ISIC-2017 segmentation training set. The ISIC challenge dataset contains several dermoscopic skin lesion images, which are collected by various dermatoscopes and camera devices worldwide. Consequently, the color normalization and illumination pre-processing must be done using the color constancy method. To process the datasets, we utilized the shades of the gray algorithm [69]. Figure 3 shows our skin lesion segmentation results by our proposed neural network architecture.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 9 of 27 decision taken by the actor. The actor aspires to get enhanced performance, while the critic aspires to be more precise and accurate. Iterative optimization is used in the training process, following the theory of adversarial networks, due to the interdependence and interaction in both the actor and the critic [72]. Moreover, our solution lets the off-the-shelf segmentation executor perform a segmentation mechanism focused on the action parameters present as a continuous set. The segmentation executor performs a brushstroke centered as a series of action parameters and draws it onto a segmentation mask taken as an input to improve the Acc of the segmentation process. The architecture of our proposed method is depicted in Figures 3 and 4. A segmentation executor is used in the proposed method to perform the segmentation action, and it is based on neural networks. By using an initial segmentation mask and an input image, the agent attempts to characterize an action series (As0,Ast,Ast+1,..AsT) using the new segmentation strategy (S) for a task that requires segmentation (mapping of action A to state S). The role of synthesizing the texture of each stroke in this RL method is conceived as a sequential decision-making mechanism centered on the MDP mechanism, with a soft tuft brush acting as an RL agent. The probability of good action is very high at any stage, i.e., the decision should increase the compatibility between the future and the previous decisions. The Acc with which the segmentation process is carried out directly affects the efficiency of the segmentation results. Segmentation methods are used to train the DDPG handler.

Overview of Our RL Method
In this article, we propose a groundbreaking multi-stage segmentation strategy based on DRL to detect skin lesions. In each step, a segmentation agent is trained to find the best segmentation technique based on the previous step's evaluation results. In this article, the DDPG algorithm trains the segmentation agent to solve the MDP problem. DL and deterministic policy gradient (DPG) are combined in the DDPG algorithms [70]. The actor uses the low-dimensional state space to make choices. The advantage of the DDPG is that it classifies policies that outperform the actor.
Since DDPG is an off-policy algorithm, it provides a huge replay buffer, enabling it to learn from a wide variety of unrelated transformations. The predicted gradient of the action-value function is a more appealing version of the DPG. Because of its simplicity, the DPG can be calculated far more accurately than the traditional stochastic policy gradient. In high dimensional action spaces, DPG algorithms outperform their stochastic equivalents drastically. DDPG is a non-policy model-based algorithm for studying continuous action. It uses DQN's ERM and slow learning goal networks and is built on DPG, which operates through continuous action spaces. Compared to traditional methods, such as the level set, Chan-Vese, and snakes, the method proposed does not require any technological skills.
The neural network optimizes the method proposed based on the segmentation results from the prior step. Thus, the network evolves techniques for segmenting skin lesion images without the need for specialized knowledge. DDPG is used to address problems that require continuous action space based on the AC architecture. It is suggested to solve two challenges: overcoming the delayed RL issue for neural networks and creating a self-learning framework based on a neural network that needs no training or reinforcement from the context. The AC paradigm is a mix of value and policy-based methods. The first uses an implicit system for learning the value function and an action-based value function to achieve the policy. Conversely, policy optimization refers to explicitly defined model capacity, such as the policy gradient (PG) [71]. Our RL-based image segmentation algorithm is shown in Algorithm 1 below: Randomly initializing actor network µ(s|θ µ ) and critic network Q(s, a|θ Q ) with weights θ Q and θ µ . Initializing of the target networks µ' and Q' and weights θ µ' ← θ µ , θ Q' ← θ Q Initializing of experience replay memory R for episode e = 1, N do Initializing a random process M for exploration of actions Received s 1 initial observation state for x = 1, T do Select action parameter set a t = µ(s t |θ µ ) + N t accordingly to the exploration noise and the current policy Feed the action parameters (A s0 ,A st ,A st+1 ,..A sT ) in the segmentation executor. Feed the updated segmentation mask S mt + 1 and the ground truth for computation of reward function r(t).

Execution of actions a t and observing reward r t and observation of new state s t+1
Storing transition (s t , a t , r t , s t+1 ) in R Sampling of a random mini-batch (s i , a i , Feed the ground truth S mt in the critic network Feed the reward r(t) and long term expected return Q to the evaluation network. Evaluation of the segmentation policy focused on reward r(t) and the long-term return Q.
Updating critic by minimize of the loss: Using the sampled policy gradient to update the actor policy: The actor critics are made up of two networks: a value network and a policy network. The former is called the critic, and the latter is called the actor. In the AC network, the actor's responsibility is to learn policy, while the critic helps evaluate the decision taken by the actor. The actor aspires to get enhanced performance, while the critic aspires to be more precise and accurate. Iterative optimization is used in the training process, following the theory of adversarial networks, due to the interdependence and interaction in both the actor and the critic [72]. Moreover, our solution lets the off-the-shelf segmentation executor perform a segmentation mechanism focused on the action parameters present as a continuous set. The segmentation executor performs a brushstroke centered as a series of action parameters and draws it onto a segmentation mask taken as an input to improve the Acc of the segmentation process. The architecture of our proposed method is depicted in Figures 3   Our RL framework for the segmentation process. The set of actions selected by the actor depend on the new segmentation mask S mt and the input image. The parameters are passed to the segmentation executor. S mt + 1, the updated segmentation mask, is then produced by the segmentation executor. The new segmentation mask has three distinct features. The mask is used for the calculation of reward by comparison with the ground truth mask. Secondly, it is given as input to the critics and the ground truth to calculate the Q's long-term estimated return. Thirdly, it is used for updating the previously used S mt segmentation mask. The critic actor and segmentation executor are both built using neural networks. The critic actor and segmentation executor are both built using neural networks. The current segmentation methodology π is evaluated by the evaluator and is based on the long-term return Q and reward R.
In the case of γ is 1, both the immediate reward and long-term returns are equally important. The segmentation policy is denoted by π. Using the Bellman equation, the critic helps estimate the decision based on the long-term return Q taken by the agent. The Bellman rule is used to establish the learning process of the long-term reward Q of the decision made by the agent. The value of Q depends on the activity and state. It is calculated using the Bellman equation. The critics' estimation increases the segmentation Acc. Instead of Stand At, the critics are fed the ground truth of St. The new value function after modification of V(St,G) undergoes training in the following Equation (3) below as: (3) Figure 4. Our RL framework for the segmentation process. The set of actions selected by the actor depend on the new segmentation mask S mt and the input image. The parameters are passed to the segmentation executor. S mt + 1, the updated segmentation mask, is then produced by the segmentation executor. The new segmentation mask has three distinct features. The mask is used for the calculation of reward by comparison with the ground truth mask. Secondly, it is given as input to the critics and the ground truth to calculate the Q's long-term estimated return. Thirdly, it is used for updating the previously used S mt segmentation mask. The critic actor and segmentation executor are both built using neural networks. The critic actor and segmentation executor are both built using neural networks. The current segmentation methodology π is evaluated by the evaluator and is based on the long-term return Q and reward R.
A segmentation executor is used in the proposed method to perform the segmentation action, and it is based on neural networks. By using an initial segmentation mask and an input image, the agent attempts to characterize an action series (A s0 ,A st ,A st+1 ,..A sT ) using the new segmentation strategy (S) for a task that requires segmentation (mapping of action A to state S). The role of synthesizing the texture of each stroke in this RL method is conceived as a sequential decision-making mechanism centered on the MDP mechanism, with a soft tuft brush acting as an RL agent. The probability of good action is very high at any stage, i.e., the decision should increase the compatibility between the future and the previous decisions. The Acc with which the segmentation process is carried out directly affects the efficiency of the segmentation results. Segmentation methods are used to train the DDPG handler. The segmentation executor adopts the brushstroke chosen by the actor in process t and obtains a modified S mt + 1 segmentation mask. In the segmentation procedure, these steps are repeated.
At the end of the segmentation phase, we have the final segmentation mask. The residual design of our model is built similar to ResNet-18 [32] and is used for the policy (actor) and the value (critic) network. Meanwhile, batch standardization is used by the policy network [73]. Our method's strength stems from integrating normalization into the model design, and the execution of normalization takes for every mini-batch of the training process. Batch normalization helps them train faster while reducing their time on the initialization process. It serves as a regularizer, obviating the need for dropout in some instances; by standardizing weights with translated ReLU (TreLU), the sensor network aids in the training of the model. Convolutional layers and fully connected layers are used in the segmentation network. The subpixel [74] technique is used in the segmentation executor to increase the brushstroke resolution. CoordConv layer is taken as the first layer by both the critic and the actor. The actor, critic, and the segmentation executor interaction framework is shown in Figure 4. The architecture details of these are described in Figure 5. In Section 3.4, the image segmentation method is depicted as an MDP process. The details of the segmentation executor and the steps used to improve our segmentation results are discussed in Section 3.3. If the constraint is additionally added, such that ∑ , ( ) =1 for all y between the last and the first knot, then the factor of scaling ∑ , ( ) becomes fixed. The resulting spline functions are called B-splines.
The defining of higher order B-splines in the form of a recursive Equation is shown in Equation (7) as follows: where , ( ) = , ≠ 0 ℎ The action bundle strategy further improves Acc and is inspired by the frameskip [57], an effective hyperparameter for several RL tasks. The frameskip determines the granularity at which environmental agents are tracked, and the action to be used shall The ERM for DDPG training is proposed in Section 3.4, which helps us obtain improved segmentation results

MDP for the Segmentation of Skin Lesion
The segmentation agent is used to find the ROI in this algorithm, with the skin lesion segmentation mechanism modeled as an MDP process. State, space S, operation A, and the reward features are the three main features of the MDP. This method's three-dimensional explanation is proposed as follows: State: many of the agent's activities in the environment are included in the state space. The agent's decision will be based on this information. The state in this work consists of the image I, the current segmentation mask S mt and the step-index t, which is defined as S t = (S mt , I, t). S mt is a mask for the segmentation process with a pixel scale of 0 or 255. The background pixels are 255 and 0. The original segmentation default value mask is 0. I is a representation of the lesion that requires segmentation. The index at step t is used to differentiate between the various phases of the segmentation. The state of the segmentation phase terminates in our multi-step segmentation process. The steps are maximized and performed until the training process is completed. Until the maximum number of stages has been achieved, the agent can enter the terminal state and continue to execute the final segmentation task.
Action: the action area includes any operation that the segmentation executor can conduct. In a state, the agent makes a policy declaration π in the space of action. The action is then used to adjust the brushstroke direction and shape, described as several parameters.
Reward function: The reward function defines the state to reward mapping in the RL task. Regularly, the agent's job is to maximize the amount of discounted future rewards R. This function signifies the immediate state reward when an action is changed, which helps evaluate the result's effectiveness for the decision taken by the agent. The M's segmentation mask changes at every step during the process of training. Consequently, the mask's Acc is measured by the comparison of the truth ground mask for every step. The L2 mean square error is used as an arbitrary metric. R l2 is the default L2 reward mode. To better reflect each step's effect, we need an essential reward feature to take advantage of the L2 change pattern. The resemblance between the two images calculates the value of L2. If two images are the same, the loss of L2 is equivalent to 0. The reward function can be modeled using the difference in Rl2 between the two adjacent steps. R diff 's reward function is shown in the Equation (1) below, where S m A t denotes the last segmentation mask, and S mt denotes the mask.
The reward function sends a favorable signal when the loss of L2 decreases and vice versa. Therefore, it is essential to predict Q's long-term return value at each point to enhance the learning trends. The reward function underlines the effectiveness of each action selected. The consistency of the operation chosen over the entire duration of the segmentation is an essential function. Q(S t , A t ) is the A t value defined in the S t state.
The Bellman Theorem is used to measure Q(S t , A t ) and use the reward function. The reward function is denoted by R(S t , A t ). Q(S t , A t ) is shown in Equation (2) below as: where Q(S t , A t ) is the value of Q for the selection of A t from the state S t . The reward function is represented by R(S t , A t ). γ signifies the discount factor that represents the advantage of future returns Q(S t + 1, (S t + 1)) in relation to the immediate reward R(S t , A t ).
When γ value is zero, it is equal to the focus only on the immediate reward while ignoring all the long-term returns. When the γ value is zero, it is equal to the focus only on the immediate reward while ignoring all the long-term returns.
In the case of γ is 1, both the immediate reward and long-term returns are equally important. The segmentation policy is denoted by π. Using the Bellman equation, the critic helps estimate the decision based on the long-term return Q taken by the agent. The Bellman rule is used to establish the learning process of the long-term reward Q of the decision made by the agent. The value of Q depends on the activity and state. It is calculated using the Bellman equation. The critics' estimation increases the segmentation Acc. Instead of S t and A t , the critics are fed the ground truth of S t . The new value function after modification of V(S t ,G) undergoes training in the following Equation (3) below as:

of 26
Finally, the algorithm of DDPG is used for optimizing the MDP for lesion segmentation. Section 3.5 provides the detail of the hyperparameter action bundle and the segmentation executor.

Action Bundle and the Segmentation Executor
The mask on which the neural network draws the brushstroke as the ROI renderer implements the segmentation executor mentioned above. There are two advantages of the segmentation executors. First, it could be well distinguished and combined with DDPG. Secondly, a neural network can have a fine-grained quality. The segmentation executor is run by using learning algorithms that are supervised and executes on a vast number of samples taken for training, which are collected from various graphical rendering systems. Several segmentation executors, including triangles, circles, square Bézier curves, and B-spline curves, produce multiple brushstroke forms. A polynomial B-spline curve is more reliable to employ than a Bezier curve since its degree is independent of the number of control points. The B-spline curve offers local control over each section of the curve through control points. For a given parameter, the total of the basis functions equals one. Based on the experimental findings, QBC and B-spline significantly benefit in the segmentation of lesion images. Thus, we use the QBC and B-spline to segment the lesion images. The QBC action parameters are described in Equation (4) as follows: where the three QBC control point coordinates are (x 0 , y 0 , x 1 , y 1 , x 2 , y 2 ), (P 0 , P 1 , P 2 ), A t = (x 0 , y 0 , x 1 , y 1 , x 2 , y 2 , r 0 , r 1 ).
The thickness of the two QBC endpoints (P 0 , P 2 ) determines the parameters (r 0 , r 1 ).
Approximately eight motion parameters were obtained using neural networks, with the stroke proportions and forms being generally distinct. The path traced by a quadratic Bézier curve is given in the form of function S(x) as seen in the Equation (5), given points P 0 , P 1 , and P 2 . The following Equation (5) below describes the QBC theorem, and the interpretation can be made by taking the linear interpolation of the points corresponding to the linear Bézier curves from P 0 to P 1 and then from P 1 to P 2 respectively. Then we rearrange the Equation (5) for obtaining S(x) in Equation (6) as follows: The tangents at QBC P 0 and P 2 converge at P 1 . The curve starts at P 0 in the direction of P 1 , where the curve ranges from 0 to 1 and curves from the direction of P 1 to the end of P 2 . A spline of the order n is a piecewise polynomial function having a degree n − 1 present in a variable x. The knots are those values of x, where the intersection of the polynomial pieces occurs, and the listing is done in the ascending order as {t 0 , t 1 ,t 2 ,....,t n }. When the distinct knots are considered, the first n − 2 derivatives of the polynomial components across each knot are continuous. Over a r knot, the spline is continuous only on the first nr − 1 derivatives of the spline. In Equation (6), there is a single spline S i,n (y) that obeys a given sequence of knots having a specific scaling factor satisfying Equation (7).
If the constraint is additionally added, such that ∑ i S i,n (y)=1 for all y between the last and the first knot, then the factor of scaling ∑ i S i,n (y) becomes fixed. The resulting spline functions are called B-splines. (7) as follows:

The defining of higher order B-splines in the form of a recursive Equation is shown in Equation
The action bundle strategy further improves Acc and is inspired by the frameskip [57], an effective hyperparameter for several RL tasks. The frameskip determines the granularity at which environmental agents are tracked, and the action to be used shall be selected. The skip frame parameter K allows the agent to take into repetition the actions at selected K frames. The connection between the associated states and the computational resources saved by this technique is explored. The connection between the different actions, referred to as the action bundle, is explored. To encourage the actor to delve further into the action space, the actor creates an action package by selecting K acts for the actions taken from the action space bundle. The segmentation executor then conducts K operations in a single action kit; thus, improving the segmentation result Acc. In Section 3.6, we discuss the ERM for DDPG.

Modified ERM for DDPG
The various DRL algorithm-training instances have been referred to as transformations. The five parameters for each transformation are the current state S, preferred actions A dependent on S, instant reward R, next step S', and terminal, i.e., when the state undergoing execution comes to an end. The ERM stores transitions (S, A, R, S', Terminal), and random sampling prevents interaction between transformations. When the ERM retains many samples from the agent's experience with the environment, a small batch of transformations for the training agent is randomly sampled from memory. The ERM is used to optimize the critic inputs for the best possible assessment, as seen in Figure 6. The rate of pixel classification referred to as Acc is determined in Equation (11), as follows:

Acc = TP/T N + TP + FN + FP
(11) Figure 6. With Modified ERM. The ERM is used to adjust the critical feedback for correct assessment.
State S and Action A are given as an input to the critic network to maintain the Q's long-term return. The performance of the critic depends on the actor's determination to determine the algorithm's efficiency and the correct policy π. The new parameter as a Ground Truth (GT) is added to optimize the necessary evaluation ability for segmentation assignment. The current transition is a new step (S, A, R, S', GT, Terminal). The ground truth and S' are sent to the critics for evaluation depending on the transition (new). On the other side, the presence of the ROI resembles the tissue that surrounds it, resulting in boundary ambiguity. In this situation, the whole scenario cannot be interpreted by the segmentation agent. The modified ERM is seen in Figure 6.

Results and Discussion
The performance of the RL algorithm is evaluated in this section. We conduct a qualitative and quantitative analysis of our proposed RL solution before comparing our results to those of various state-of-the-art segmentation algorithms and methods.

Implementation Details
The type of hardware used to assess network efficiency has a significant effect on its performance. The NVidia P100 form is used in the suggested RL method (16 G.B. RAM@1.32GHZ), where the performance is 9.3 TFLOPS. The train and test datasets are described in the device configuration discussed above. NVidia P100 is fitted with a 1.32 GHz processor and 16 G.B. of RAM for network training and benchmarking. The architecture uses Ubuntu 16.04, which is based on the programming language Python version 3.8. The Adam optimizer is used by the network, which has a learning rate of [1e-4] and a microbatch of 16. The decision-maker acts in the most simplistic form, receives a reward from the environment, and the environment shifts its state. The decision-maker then detects the state of the environment, takes the initiative, earns a reward, and so on. The state transformations are probabilistic and are solely determined by the current state and the behavior shown by the actor. The actor's reward is calculated by the behavior taken and the initial and current condition of the environment. The reward profit ratio for gamma is set to be at 0.850. The memory replay experience has been set to 600. The action bundle is set as K = 5, and the step number is set as t = 3.

Evaluation Metrics
To ensure the performance of the proposed models, the basic statistical parameters used in other literature works have been studied. (Sen) is calculated in Equation (9), as follows: It means that the number of lesion pixels in the image is distributed uniformly. Similarly, the parameter Spe determines if the pixels proportions have been correctly assigned to the image and is given in Equation (10) as follows: The rate of pixel classification referred to as Acc is determined in Equation (11), as follows: Acc = TP/T N + TP + FN + FP (11) The spatial overlap that is present between the assigned binary mask and the segmented image is defined as the Dice coefficient (Dice), and is measured in Equation (12) as follows: Dice = 2 TP/2 TP + FP + FN The Jaccard index is the relationship between the binary labels and the pixel values analyzed for the input image. The Jaccard index is determined in Equation (13) as follows: Jaccard Index = T P/T P + FN + FP (13) It is generally used to measure the change in the center of transformation present in the image axis. While true positive (TP) correctly depicts lesion pixels, false positive (FP) incorrectly depicts non-lesion pixels as lesions, true negative (TN) depicts all incorrectly labeled non-lesion pixels, and false negative (FN) represents the incorrectly identified lesion pixels.
A pixel's distance to a surface is defined in Equation (14) as follows: The Hausdorff distance between two surfaces M and N is computed using the difference between the predicted segmentation result and the ground truth. A lower HD number means the performance of the segmentation algorithm is good. The RVD value infers if the segmentation performed by the algorithm helps in the selection of the ROI area, more or less. The algorithm extracts a larger region in the ROI segmentation, if the value of RVD is positive or negative. The relative volume difference is calculated using the following RVD formula in Equation (15), as follows:

Evaluation and Comparison on the ISIC 2017 Dataset, HAM10000, and the PH2 Dataset
The trained neural networks for the CADx diagnosis of the pigmented skin lesions are challenging due to the lack and diversity of dermatoscopic image datasets. The HAM10000 dataset (Human against Machine with 10,000 training images) [68] is available to solve this problem. In this dataset, different modalities are used to collect and retain dermatoscopic images from various populations. The generated dataset contains 10015 dermatoscopic images that are utilized for training machine-learning algorithms. The cases in this dataset include all of the essential diagnostic categories as a representative collection in the pigmented lesions realms, such as intraepithelial carcinoma/actinic keratosis, Bowen's disease (akiec), benign keratosis-like lesions, basal cell carcinoma (bcc) (seborrheic keratosis, lichen planus-like keratosis, and solar lentigines), and dermatofibroma are some of the conditions (df).
The segmentation masks are evaluated on this dataset, and, as can be seen in Figure 7, the masks are very precise and clear. Figure 8a represents the initial input image with Figure 8b,c showing the segmentation of other methods that use a variety of loss functions. It shows the influence of the loss function TL and GN on the input images. In these, attention gates (AG) and group normalization (GN) analyze the input images and help in the prediction of the boundaries of skin lesions. dermatofibroma are some of the conditions (df).
The segmentation masks are evaluated on this dataset, and, as can be seen in Figure 7, the masks are very precise and clear. Figure 8a represents the initial input image with Figure 8b,c showing the segmentation of other methods that use a variety of loss functions. It shows the influence of the loss function TL and GN on the input images. In these, attention gates (AG) and group normalization (GN) analyze the input images and help in the prediction of the boundaries of skin lesions. Figure 7. Evaluation results on the HAM10000 dataset. As can be seen in the second and the fourth rows, the results are precise and clear. Figure 7. Evaluation results on the HAM10000 dataset. As can be seen in the second and the fourth rows, the results are precise and clear. Figure 8d shows the lesion ROI computed by our RL algorithm. The results infer that the lesions are segmented in an accurate and precise manner. Our method reuses the CNN feature map and shortens the training and testing times, which helps us to train our networks from start to finish efficiently. Our algorithm enables the identification of high-dimensional hierarchical images. The detection approach is robust to changes in conditions, such as illumination and color balance. Our algorithm outperforms the alternative methods, as seen in Figure 9. Figure 9a shows the visual comparison evaluated on the ISIC 2017 image dataset (with a black background). Figure 9a shows the input images from the dataset. The provided binary mask (ground truth) is depicted in Figure 9b, whereas the method's prediction results (A.G. U-Net + GN) are depicted in Figure 9c.
Our RL algorithm results can be seen in Figure 9d. As can be seen, our algorithm is capable of segmenting lesions with great Acc. However, the segmentation mask results show the exact shapes of the lesions in some instances, which differ somewhat from the ground truth. Since the ground truth images are annotated manually, such minor errors can happen. Figure 10 shows the visual projections that have been derived from the proposed ISIC 2018 image dataset (present without the black background). Figure 10a shows the original image; Figure 10b shows the given binary mark (ground truth); Figure 10c the prediction effects of the method (Att U-Net + GN); Figure 10d the results of our RL algorithm. As can be seen, our segmentation mask corners are sharp and clear. The masks are much similar to the ground truth images.  Figure 8d shows the lesion ROI computed by our RL algorithm. The results infer that the lesions are segmented in an accurate and precise manner. Our method reuses the CNN feature map and shortens the training and testing times, which helps us to train our networks from start to finish efficiently. Our algorithm enables the identification of high-dimensional hierarchical images. The detection approach is robust to changes in conditions, such as illumination and color balance. Our algorithm outperforms the alternative methods, as seen in Figure 9. Figure 9a shows the visual comparison evaluated on the ISIC 2017 image dataset (with a black background). Figure 9a shows the input images from the dataset. The provided binary mask (ground truth) is depicted in Figure 9b, whereas the method's prediction results (A.G. U-Net + GN) are depicted in Figure 9c.
Our RL algorithm results can be seen in Figure 9d. As can be seen, our algorithm is capable of segmenting lesions with great Acc. However, the segmentation mask results show the exact shapes of the lesions in some instances, which differ somewhat from the ground truth. Since the ground truth images are annotated manually, such minor errors can happen. Figure 10 shows the visual projections that have been derived from the proposed ISIC 2018 image dataset (present without the black background). Figure 10a shows the original image; Figure 10b shows the given binary mark (ground truth); Figure 10c the prediction effects of the method (Att U-Net + GN); Figure 10d the results of our RL algorithm. As can be seen, our segmentation   Figure 11h shows the results of our RL algorithm. As can be seen from the image, the results of the masks are precise. The background is reflected in the darker (blacker) area, while the foreground is the lighter (white) area.
This method is quick, and it helps in calculating the running time easily. This method is well balanced because there is a high degree of separation between the foreground and the background. The proposed number of parameters in a model, storage measurements, and inference speed is compared to other state-of-the-art models. The GN used in previous work tests the mean and variation of the channel groups since they have been standardized, e.g., the AG U-Net model scans images in epochs that can easily connect to other algorithms. In BCDU, the image upload epoch is 359 s, while the U-Net baseline is 165 s. It takes 133 s to encrypt a terabyte of info. For the NVIDIA Quadro K1200 GPU, the proposed solution's predicted performance (AG U-Net + GN + TL) is faster than the U-Net baseline with a 256 × 256 input scale [75]. segmentation results of the SE block on the specific U-Net; Figure 11e the segmentation results of the BCDU network (with 1 dense unit); Figure 11f segmentation results of the U-Net network (with all the 64 filters); Figure 11g Att-U-Net + GN + TL segmentation results. Finally, Figure 11h shows the results of our RL algorithm. As can be seen from the image, the results of the masks are precise. The background is reflected in the darker (blacker) area, while the foreground is the lighter (white) area. The results of our RL algorithm. Our algorithm is able to segment the lesions clearly. However, in some cases, it shows exactly the lesion shape in the segmentation mask, deviating a bit from the ground truth. However, as can be seen, the segmentation masks are clear. This method is quick, and it helps in calculating the running time easily. This method is well balanced because there is a high degree of separation between the foreground and the background. The proposed number of parameters in a model, storage measurements, and inference speed is compared to other state-of-the-art models. The GN used in previous work tests the mean and variation of the channel groups since they have been standardized, e.g., the AG U-Net model scans images in epochs that can easily connect to other algorithms. In BCDU, the image upload epoch is 359 s, while the U-Net baseline is 165 s. It takes 133 s to encrypt a terabyte of info. For The results of our RL algorithm. Our algorithm is able to segment the lesions clearly. However, in some cases, it shows exactly the lesion shape in the segmentation mask, deviating a bit from the ground truth. However, as can be seen, the segmentation masks are clear. Figures 9-12 demonstrate the proposed RL method's better segmentation performance than the other state-of-the-art methods. The performance metrics visualization for the PH2 and the ISIC 2017 challenge dataset is shown in Figure 12. The statistical measures used by our methods, such as Acc, Dice, Jaccard index, Sen, and specificity are plotted for the PH2 and the ISIC 2017 skin segmentation dataset. The blue line denotes the performance metrics plotted for the PH2 dataset, and the red line shows the metrics for ISIC 2017 skin segmentation dataset. In Figure 12b, the statistical measures of our method denoted as Dice, JSI, MCC, and the overall of these statistical measures are plotted for the three categories of skin lesions: naevus, melanoma, and seborrheic keratosis. While the other blue line and the red line denote the PH2 and the ISIC 2017 dataset metrics, the highlighted points indicate the metric values such as Sen, Spe, and Acc for each of the categories on both datasets. Figure 12c shows the action bundle effect.

The visual segmentation results in
The qualitative results show our model's segmentation mask performing better than the other models. The statistical measures of our RL algorithm are compared to the results of the standard algorithms, such as U-Net [36], SegNet [38], FrCN [42], etc. The Acc, dice values are 95.39% and 95.7%, ensuring that our algorithm results are superior to the other state-of-the-art methods. From this, we infer that we can qualitatively and quantitatively segment skin lesions better than the other approaches.   Table 1. The various values of K are the different settings for the action bundle. We evaluate the effect of the value K on the Dice Values. We infer from the results that our segmentation results improve when we use both the modified ERM and action bundle. In Table 2, our approach works better for Sen and Spe than all of the algorithms, with values 98.59% and 97%. We aim to design these algorithms for high Sen and Spe. The Sen value is larger than other methods in terms of various statistical parameters taken for comparison. Table 2 compares the proposed model's segmentation efficiency to that of state-of-the-art methods on the PH2 dataset in terms of Dice index, Jaccard index, Acc, Sen, and Spe. The Acc, Jaccard index, and Spe values are 0.96, 0.92, and 0.97. The higher Acc shows that the ratio of the correctly segmented area over the ground truth is higher with respect to other methods. The percent overlap between the target mask and the predicted mask is higher. Table 3     Our findings are also compared to U-Net [36], SE U-Net [37], BCDU [52], DeepLab V3+ [51], and other methods. The SE U-Net [35] approach works in a manner analogous to the proposed technique as it simplifies the inter-class dependencies by adding several convolution layer parameters. However, it is unable to perform the segmentation of minor and complicated lesions. The value of RVD is negative and has a value of -0.0451, which signifies our better performance compared to the other methods. In Tables 4 and 5, we compare our RL algorithm on three different lesion types: seborrheic keratosis (SK), naevus, and melanoma. It is compared with methods such as FCN-AlexNet [10], FCN16s [15], FCN 8s [41], etc. In Table 4, the algorithm outperforms other methods with an overall value of the Dice index as 93.98%, JSI as 88.79%, and MCC of 90.21%. The higher Dice and Jaccard index values indicate the better similarity of our qualitative results in comparison with the ground truth images. Table 5 shows the overall values for the various categories of the lesions are 96.25% for Sen, Spe as 94.71%, and Acc as 95.33%. Thus, we conclude that our RL algorithm on the benchmark datasets has enhanced the segmentation Acc, thus outperforming the state-ofthe-art methods, increasing 7%, 8%, 9%, respectively, for Dice index, specificity, and Jaccard index, respectively. The other statistical measures are also higher for our RL methods. Consequently, the computational complexity of the one-initialized Q-learning algorithm achieves a goal state with a target reward representation and finishes in less than O(en) steps. S denotes a finite state set, and G∈S is the goal states non-empty set. A(s) is the actions finite set that are taken in execution such that s ∈ S.where e = ∑s∈S|A(s)|. Our algorithm can solve complex problems of identifying skin lesions, which is very difficult using conventional techniques. The technique achieves long-term results and corrects the error that occurs during the training process. Once our model has corrected an error, there is less chance that it occurs again.  We evaluate the effect of the value K on the dice values. We infer from the results that our segmentation results improve when we use both the modified ERM and action bundle. The qualitative results show our model's segmentation mask performing better than the other models. The statistical measures of our RL algorithm are compared to the results of the standard algorithms, such as U-Net [36], SegNet [38], FrCN [42], etc. The Acc, dice values are 95.39% and 95.7%, ensuring that our algorithm results are superior to the other state-of-the-art methods. From this, we infer that we can qualitatively and quantitatively segment skin lesions better than the other approaches.
Our findings are also compared to U-Net [36], SE U-Net [37], BCDU [52], DeepLab V3+ [51], and other methods. The SE U-Net [35] approach works in a manner analogous to the proposed technique as it simplifies the inter-class dependencies by adding several convolution layer parameters. However, it is unable to perform the segmentation of minor and complicated lesions. The value of RVD is negative and has a value of -0.0451, which signifies our better performance compared to the other methods. In Tables 4 and 5, we compare our RL algorithm on three different lesion types: seborrheic keratosis (SK), naevus, and melanoma. It is compared with methods such as FCN-AlexNet [10], FCN16s [15], FCN 8s [41], etc. In Table 4, the algorithm outperforms other methods with an overall value of the Dice index as 93.98%, JSI as 88.79%, and MCC of 90.21%. The higher Dice and Jaccard index values indicate the better similarity of our qualitative results in comparison with the ground truth images.   Table 5 shows the overall values for the various categories of the lesions are 96.25% for Sen, Spe as 94.71%, and Acc as 95.33%. Thus, we conclude that our RL algorithm on the benchmark datasets has enhanced the segmentation Acc, thus outperforming the state-of-the-art methods, increasing 7%, 8%, 9%, respectively, for Dice index, specificity, and Jaccard index, respectively. The other statistical measures are also higher for our RL methods. Consequently, the computational complexity of the one-initialized Q-learning algorithm achieves a goal state with a target reward representation and finishes in less than O(en) steps. S denotes a finite state set, and G∈S is the goal states non-empty set. A(s) is the actions finite set that are taken in execution such that s ∈ S.where e = ∑s∈ S |A(s)|. Our algorithm can solve complex problems of identifying skin lesions, which is very difficult using conventional techniques. The technique achieves long-term results and corrects the error that occurs during the training process. Once our model has corrected an error, there is less chance that it occurs again.
In case the training dataset is absent, the RL algorithm can learn from its experiences. It maintains a balance between exploitation and exploration, where exploration is the process of exploring the most rewarding ways to reach the target. In contrast, exploitation is searching for the most optimal solutions to see if they are better than the solution that has been tried before.

Conclusions
This paper proposes an effective multi-step approach for skin lesion segmentation based on a deep reinforcement-learning algorithm. The segmentation process is proposed as a Markov decision process and is solved by training an agent to segment the region of interest using a deep reinforcement-learning algorithm. The agent follows a set of serial actions for the region delineation. The action is defined as a set of continuous parameters. The segmentation accuracy is boosted further by using enhanced replay memory and the action bundle as a hyperparameter. The outcomes of the experiments demonstrate that the proposed reinforcement learning method yields good results. In the future, this method can be used for other medical image segmentation tasks and other forms of diagnostic imaging. The proposed approach can also detect small irregular-shaped objects or objects with no fixed geometry in the segmentation task. The statistical results infer the better performance of our reinforcement-learning algorithm on the datasets; thus, outperforming the state-of-the-art methods with an increase of 7%, 8%, 9%, respectively, for Dice index, specificity, and Jaccard index, respectively. The other statistical measures, such as accuracy and MCC, also rank higher than the other methods. Thus, the reinforcement-learning model proposed can learn with ease how to segment complex skin lesion images.