Edge-Driven Multi-Agent Reinforcement Learning: A Novel Approach to Ultrasound Breast Tumor Segmentation

A segmentation model of the ultrasound (US) images of breast tumors based on virtual agents trained using reinforcement learning (RL) is proposed. The agents, living in the edge map, are able to avoid false boundaries, connect broken parts, and finally, accurately delineate the contour of the tumor. The agents move similarly to robots navigating in the unknown environment with the goal of maximizing the rewards. The individual agent does not know the goal of the entire population. However, since the robots communicate, the model is able to understand the global information and fit the irregular boundaries of complicated objects. Combining the RL with a neural network makes it possible to automatically learn and select the local features. In particular, the agents handle the edge leaks and artifacts typical for the US images. The proposed model outperforms 13 state-of-the-art algorithms, including selected deep learning models and their modifications.


Introduction
Worldwide, breast cancer is the most common cancer in women [1].If diagnosed early, the survival rates are significantly improved.In Asia, US examination is included in routine cancer screening due to the high prevalence of females with dense breast tissue [2].Accurate segmentation assists in evaluating the type and stage of the tumor as well as its possible progression.However, US images are often hard to interpret.They require expertise and experience.Detection and segmentation of breast tumors is a difficult task even for skilled clinicians.They also present a challenge for CAD (computer-assisted diagnostic) systems.The US machine receives an echo of the emitted sound waves.The image of this echo (the US image) is characterized by speckle noise, irregular tumor shapes, uneven textures, poorly defined boundaries, poor contrast, and multiple acoustic shadows.The recent advances in convolutional neural networks (CNNs) and deep learning (DL) tools show much promise for the segmentation of the US images.Nevertheless, the learned features are often not interpretable.The training requires a large amount of ground truth data (usually thousands of images).Moreover, the DL trained on a specific set often fails when a different US machine or a different protocol is used [3].Zhang et al. [4] notes that the "application of these DL models in clinically realistic environments can result in poor generalization and decreased accuracy, mainly due to the domain shift across different hospitals, scanner vendors, imaging protocols, and patient populations. ..the error rate of a DL was 5.5% on images from the same vendor, but increased to 46.6% for images from another vendor".The nature of CNNs is to capture the local contextual information.However, they often miss tumors or produce over or under-segmented boundaries.One of the reasons is that the global structure of the object is not captured.
The proposed model is motivated by the autonomous vehicles of Braitenberg's [5] and Reynolds' bird flocks [6] proposed in the 1980s.Baitenberg's seminal work in 1984 used reactive robots or vehicles with relatively simple sensor-motor connections aimed to simulate neuro-mechanisms in biological brains.In this paper, Braitenberg's work is considered a metaphor for artificial life, suggesting that the complex behavior may be a result of a relatively simple design.The ideas are modified here.The robots do not know their final goal.They follow the rewards defined by a certain reward system.Additionally, they can be corrected by the neural network (NN), which observes the population and knows their final destination.Therefore, the NN is a metaphor for a "Universal God", "Galactic Force", etc.The idea is supported by artificial life (Alife) [7], based on tracing agents [8] and fusion images.It has been shown that, under certain conditions, Alife outperforms the state-of-the-art.Furthermore, Alife requires smaller training sets.The drawback of [7] is the requirement of the additional set of elasticity images, which are not always available.Further, the above ALife is based on fixed rules and requires extensive training performed manually or by a genetic algorithm.Therefore, an extension of this model is offered.The new model does not require the elasticity images and achieves the same or better accuracy.It uses a relatively small training set, and is capable of handling previously unseen data.
On this level, the basic ideas are to: (1) generate a preliminary binary mask and offset trajectories to guide the agents, (2) train the agents using the convolution network, and (3) include the RL to replace the requirement of the image fusion.The model combines multi-agent reinforcement learning, neighbor message passing, and deep learning to generate trajectories of the agents, approximating the boundary of the lesion.Message passing is based on principal neighborhood aggregation [9].Multi-agent deep reinforcement learning (MADRL) employs the Gestalt laws (GL) [10] of shape perception to construct the reward system.
Summarizing the new elements above, the novelty of the proposed model is as follows.
-A new hybrid of the DL, Alife, and RL, has been offered.The method combines the strengths of the DL neural networks with the efficiency and simplicity of ALife and the agents' communication and learning by RL. -A reward system based on the GL.-An original classification of the images based on the properties of the edge maps included in the training algorithm.-Verification by previously unseen data.
We also consider it valuable that the proposed bio-inspired model does not use the standard genetic framework which often slows down the optimization.The agents are identical, they do not evolve, do not mate, and do not have leaders.This makes it possible to achieve fast computational times, e.g., 5-20 s per image 500 × 500 (Section 7).The agents communicate and try to maximize their rewards.Hence, the model is able to generate complicated patterns, fit irregular structures, and segment the image efficiently.The proposed MADRL algorithm has been tested on two US datasets against 13 state-ofthe-art algorithms.They include active contours (AC), edge linking, level set methods (LS), superpixels, machine learning, deep learning algorithms, and their modifications.The numerical results reveal that the MADRL outperforms its competitors in terms of accuracy when applied to high-complexity images.Note that the paper performs cross-field testing, i.e., compares the segmentation methods based on different principles.The most popular and original algorithms from six classes have been selected.First of all, consider the deformable models, i.e., AC and LS.They have been massively used for edge-based segmentation and extensively used for segmentation of the US images of the breast and thyroids.The latest survey [11] considers the deformable models of the most frequently used approach for segmentation of the US images.The adaptive diffusion flow [12] is one of the most remarkable versions of the AC.It establishes a joint framework between the classical gradient vector flow and the image restoration.The recent AC energy-based pressure approach [13] introduces an "energy balance" between the inner region and the outer region.Finally, the hybrid AC-LS method [14] based on the local variations of the grey level shows excellent results on various synthetic and real medical images.According to a survey in [7], these three algorithms outperform 29 top models from this class (https: //sites.google.com/view/edge-driven-marl-siit-biomed/additional-references,accessed on 28 June 2023).The pioneering LS work is Osher, 1988 [15].The propagating fronts, with velocity depending on the front curvature, became revolutionary in the image processing of the 2000s.The distance-regularized edge-based DRLSE [16] is one of the most popular.Introducing the third dimension solves the problem of the multiple ACs and their possible merge.We refer the reader to various modifications of the LS given in the excellent review [17].The DRLSE, the saliency-driven region/edge-based LS [18] and a correntropybased LS [19] have been shown to outperform eight top algorithms from this class ( https: //sites.google.com/view/edge-driven-marl-siit-biomed/additional-references,accessed on 28 June 2023).
Edge linking methods connect pieces of the object boundary of the edge map into a closed contour subject.The algorithm [20] is based on edge tracing and a Bayesian contour extrapolation.The ratio contour method [21] encodes the GL laws of proximity and continuity combined with a saliency measure based on the relative gap length and average curvature.Some of these ideas have been used in our model.The above models outperform four edge-linking procedures (https://sites.google.com/view/edge-drivenmarl-siit-biomed/additional-references,accessed on 28 June 2023).Note that the early variants apply the graph-based approach, where the graph represents candidate parts of the boundary, and the weights represent the affinity between them.The graph is partitioned to optimize a selected utility function.Further, the edge linking evolved in different directions.Some prominent examples are connected components [22], pixel following [23], and the dot completion game [24].
A recent review [25] claims that the superpixel models are one of the most important tools for image segmentation.In particular, the simple linear iterative clustering (SLIC) proposed in [26] is one of the most efficient.We compare the RL with the recent version of SLIC [3], which integrates the NN and k-NN techniques.Our second superpixel reference is [27].This model applies the superpixels to training, whereas segmentation is performed by the LS.The prior knowledge regarding the possible shape of the object is included in the LS functional.
Machine learning (ML) (deep learning and neural networks) is one of the most promising fields of study aimed at processing and recognizing US images [28,29].The ML algorithm adapts itself to generate the required features instead of selecting them manually.The most successful models are CNNs and DL networks [30,31].
We consider U-Net [32] as one of the most prominent and influential models of this type.The U-net is often treated as "the benchmarking DL model for medical image processing" [33].A hybrid of the ML and the Chan-Vese algorithm [34] is proposed by [35].It combines the k-NN method and the support vector machine.We also test against a semisupervised DL generative adversarial network proposed by [36].The DL applies the multiscale framework and combines the ground truth and the probability maps obtained from the un-annotated data.The model outperforms the ASDNet, ZengNet, and DANNet (https:// sites.google.com/view/edge-driven-marl-siit-biomed/additional-references,accessed on 28 June 2023).We also test against the selective kernel [37].Currently, the selective kernel is one of the most developed U-Net algorithms, which outperforms many recent modifications of the U-Net.It should be noted that the above is not a complete review of the state-of-art of medical image segmentation.The above models and some of their modifications have been selected as the benchmark for testing.The selection criteria are publications in a reputable journal, references, tests against similar and cross-field models, and availability of the code.Our numerical results show that the proposed DL reinforcement learning outperforms 13 selected models from different classes (cross-field testing).The advantage against the second-best performing model on the images characterized by a high complexity is as high as a 97% (success rate) against 77%.The results are subjected to the selected set of images and the proposed accuracy measures.The developed code is applicable to breast cancer diagnostics (automated breast ultrasound), US-guided biopsies, as well as for projects related to automatic breast scanners.A demo video illustrating the algorithm is at (https://drive.google.com/file/d/1kqW68mdQ1QmkasA1gnXNFRRaavhvpA-M/view?usp=drive_link, accessed on 30 August 2023).

Reinforcement Learning
Reinforcement learning (RL) is a strategic framework in which the agent learns to perform a specific task through a series of actions and rewards.

Concept of Reinforcement Learning
The concept of RL has its roots in psychology and biology.It was introduced in the early 20th century by animal learning theorists, such as Pavlov and Thorndike.Further, Sutton and Barto [38] used these ideas for ML and Alife.The multiple agent RL models introduce cooperating agents to handle complex tasks [39].As the DL evolved, the RL was combined with DL to become deep reinforcement learning (DRL) [40], where DL acts as the control center.Nowadays, the DRL is becoming increasingly popular to further explore "the rabbit hole of the interactive AI" [39,41].
The recent examples are the DRL in robotics [42,43], simulations [40], language processing [44], autonomous vehicles [45], computer vision [41,46], and medical image processing [47].Some popular DRL variants are deep Q-networks [48], deep deterministic policy gradient [49], trust region policy optimization [50], and proximal policy optimization (PPO) [51].As opposed to supervised learning, DRL offers sequential decision-making [41], which relies on exploratory experiences.This is particularly suitable for image segmentation.The RL models are able to develop a strategic path toward an optimal collective solution beyond local temporal gains.In this sense, RL tries to mirror human learning mechanisms to adapt to variable, complex environments.

Reinforcement Learning for Medical Image Segmentation
Segmentation in medical imaging is usually performed by supervised learning.This requires extensive annotated data, which is often lacking in the field.The RL partly circumvents these limitations since the RL model can be trained on smaller sets.During the last two decades, the RL has not been particularly popular in medical image processing.The main application of this algorithm was the Q-learning for obstacle avoidance in robotics [52].However, the recent review [53] considers image processing to be one of the key applications of Q-learning.An early attempt to adapt Q-learning to detect the segmentation threshold [54], evaluates the action probability using a variant of the Boltzman policy [55].An experienced operator guides the agents via the graphics user interface.References [56,57] propose an online RL evaluation.The US image is divided into sub-areas, and the actions are defined as adjusting the thresholds and the structural elements.The agents are rewarded by using the groundtruth data.An RL segmentation of the computed tomography images is proposed in [58].The author proves that the approach requires a smaller set of training data compared to other methods.The Q-matrix is used to fix the segmentation thresholds.The approach is fully automated and parallelized.However, the above models show a lack of adaptation and generalization.In particular, the algorithm often fails when applied to unseen data.To circumvent these problems, ref. [59] proposes an online RL, where the agents memorize their previous interactions.Segmentation is performed by using the instantiating initial points and the user feedback to update the state-action space.The model has been applied to the segmentation of cardiac MRIs.A multi-stage DRL combined with the actor-critic algorithm has been proposed in [60].The value and policy networks are variants of ResNet-18 [61].The model has been applied to MRIs and retinal images.
A recent model [62] performs segmentation of the MRI images for the diagnosis of cardio-diseases.The method combines double Q-learning with the deep neural network [63] trained to find the true edge.The DRL has also been proposed for parameter optimization in [64] and initialization of the object mask in [65].Image segmentation and classification are interconnected problems often performed simultaneously.The examples of such models are [66][67][68].The DRL for image registration is used by [69,70].According to the recent study [47], the advantages of DRL over the conventional schemes are the reduction of training data, the reduced memory, and the ability to discover new features during the sequential search for the optimal solution.

Agents
The agents are trained by the RL using the edge map, the offsets, and the grayscale.The successful trajectories connecting broken edges constitute a set of candidates of the boundary.The agents communicate and adapt their movement to maximize the reward.The boundary constitutes a small fraction of the edge map (about 1-5%).In order to reduce the search space, the model creates an attention mask.Offset trajectories are generated within the mask to guide the agents.The mask is generated using histogram equalization, a variant of the superpixel decomposition [3] combined with neutrosophic clustering [71].Figure 1 illustrates generation of the mask.

Offsets
The offsets are generated by the trim-and-join algorithm [72] and are parameterized by the cubic splines.For practical purposes, we experimented with the internal and external masks to further optimize the approach.However, the generation of the internal mask has not been fully automated.

Message Aggregated Deep Reinforcement Learning
The proposed message aggregated/multiple agent DRL algorithm is based on the Gestalt Laws of shape perception.It should be noted that the GL has been included in many image processing and pattern recognition algorithms.We refer the reader to the excellent review [73] and the recent publications [74,75].The proposed MADRL includes:

•
The convolution encoder (feature extraction); • Message passing and feature aggregation unit; • The (deep neural network) DNN integrated with the DRL.
The DNN generates the stochastic policy π.Based on the observation o(t), the action of the agent is a step in a certain direction (the velocity vector v i (t)).The reward function is a linear combination of multiple rewards (Section 3.4).A gradient-based RL is used, where the parameterized policies are optimized according to their expected returns.

Observation Space
The agents follow the offsets in either the clockwise (CW) or counterclockwise (CCW) direction.During their lifetime, they observe the environment within a specified window w i .Their actions are generated by the partially observable Markov decision model proposed in [76,77].The model includes the state space, the action space, the transition function, the reward function, the observation space, and the observation probability distribution.The agents are identical.The agent i reads the observations o i (t) and aggregates features f i (t) from the neighboring agents.Based on the above, the agent evaluates the rewards r i (t) and selects the action a i (t).The observation vector o i (t) is characterized by:

•
The state: the agent is tracing the edge (state = edge) or flies outside the edge (state = free).

•
The offset velocity is v s i (t).If the agent is free, v s i (t) is the tangent to the closest point on the offset trajectory.

•
The average curvature of the trajectory along an interval [t − t 2 , t] is evaluated by where κ i (s) is the curvature, and s is the parametric variable along the trajectory.Further, the agent has a memory, which includes the last three frames M i (t) = (M i (t − 2), M i (t − 1), M i (t)).Hence, the observation space includes the observation vector, the aggregated feature vector, and the grid maps M i .Information exchange allows the generation of complicated patterns to fit the unknown boundary [39,78].The neighborhood aggregation [9] employs the mean, maximum, minimum, and standard deviation.The features undergo amplification and attenuation and are saved in the aggregated feature vector f i (t).
The key elements of the system are shown in Figure 3. Finally, note that the feature aggregation captures the behavior of the agents and their interactions with the environment, whereas the convolution unit processes stationary pixel data.

Reward Function
The reward function based on the GL is considered as the domain knowledge [79].The signals are dynamic.They control the perception of the boundary shape and the contour linking [80,81].Recall that the GL includes continuity, closure, and proximity.Therefore, the reward r i (t) is given by, where w a a = c, cl, p, d are the corresponding weights.The continuity reward is given by r continuity,1 i where s i,C (t) is defined by Equation (1).Recall that the agents are following the offsets in the CW and CCW directions.When the CW and CCW agents collide they connect the corresponding edges.Hence, we consider the angle θ i,k between the trajectory i and k at the collision point.Additionally, the agents can not live long outside the edge.If t free > t live , the agent dies.Therefore, the continuity reward is incremented as follows r continuity,2 i The proximity is rewarded as follows If the agent returns to a neighborhood of its departure, the reward is incremented The density reward is given by The notations above are self explanatory.The weights are a part of the training set (Sections 4 and 7).

Network Architecture
The proposed network is shown in Figure 4.The grid maps, observations, and rewards are the input of the DNN.
The DNN consists of six hidden layers (three convolution and three dense layers).The grid maps M i are used to extract the low-level features of the agent's environment.The resulting features are then combined into the feature vector.The first convolution layer applies 32 two-dimensional filters with a kernel size of three and a stride of one.A maxpooling layer with kernel two and stride two follows.The second and third convolution layers consist of 64 and 128 2D filters, respectively, with kernel three and stride one.They are followed by max-pooling layers with the kernel two and stride two.The feature maps are flattened into a 256-dimensional vector.The flattening layer concatenates o i (t), f i (t), and r i (t) and feeds the next two fully connected layers with 128 rectifier units.The final layer is a fully connected layer with a sigmoid activation [82].

Network Training
The PPO [51] is used for training by an on-policy gradient algorithm [60] based on the DRL actor-critic techniques.The PPO produces smooth gradient updates to ensure a stable policy space and reduced hyperparameter tuning.The agent trajectories are standardized and used to construct the surrogate loss L θ and the least square error (L 2 loss) L φ .The loss function is minimized using the Adam optimizer [83].The policy is updated over E π epochs and the value function is updated over E φ epochs.
where Êt denotes a stochastic estimate of the expected value over a mini-batch of transitions.The clipping parameter controls the range of the probability ratio r t (θ) used for the update.Âi (t) is a generalized advantage estimate [84] that smooths the discounted rewards and reduces the variance of the policy gradients to make the training stable.In addition, it indicates how good a particular state is.The estimate is defined by where λ is the smoothing factor, γ is the discount factor for the future rewards, T i is the time step of the episode, and i is the index of the agent.The term δ i implies the advantage of the new state over the previous state.It takes into account the immediate reward r i (t) and the expected future value V π s i (t + 1), discounted by a factor of γ.
The value function V π s i is used to update the policy function.The L 2 loss is given by where K is the total number of trajectories, V φ (s) is the value function approximated by the network by the parameters φ.Vi (t) is the estimated discounted return.Further, Note that the policy and value networks are characterized by the same architecture but their parameters are not shared.The optimal policy π θ is being generated by the networks simultaneously.
The pseudocode of the training algorithm is presented as Algorithm 1, while

Training Data
The network has been pre-trained using the transfer learning scheme [85].The MADRL has been tested on 1000 US images from http://www.onlinemedicalimages.com (accessed on 17 June 2023) of Thammasat University Hospital, Thailand, Bangkok (Philips iU22 US machine) and https://www.ultrasoundcases.info(accessed on 20 July 2023) from The Gelderse Vallei Hospital, Ede, The Netherlands (Hitachi US machine).Three experienced radiologists with Thammasat University Hospital provided the ground truth using the electronic pen on a Microsoft Surface Tablet.The final ground truth is obtained by the majority rule.The image resolution ranges from 200 × 200 to 600 × 480 with a 60:40 ratio for training and evaluation.The training edge maps are categorized according to their complexity.The categories include the complexity of the shape, the length of the gaps relative to the contour, and the maximum gap on the contour (Section 5).The approach allows us to find a suitable solution in a less difficult environment fast and to progressively advance the complex environments.When stability and convergence are established in each category, the training is completed.

Experimental Results
We introduce the following quality measures.

Contour Based Metrics
The Hausdorff distance is defined by where X is the ground truth contour, and Y is the resulting contour.
The average Hausdorff distance is defined by L X and L Y are the lengths of the resultant contours of two contours X and Y, respectively.The relative Hausdorff distance is defined by Note that dist H 1 and dist H 2 evaluate the absolute difference between the contours.However, dist H 3 is normalized to the length of the contour.Therefore, if the set includes objects of a different size (which is the case) dist H 3 is preferable.

Area-Based Metrics
Recall: Precision: Accuracy: Jaccard Index: Dice Index: Here, TP, FP, TN, and FN denote true positives, false positives, true negatives, and false negatives.Note that Acc is proportional to TN .That is why it is often large even when the segmentation quality is poor.The Dice and the Jaccard indices are often considered the most reliable area-based measures in medical image processing [86].

Successful Segmentation
It is often the case that the model produces a few outliers with a low accuracy.In this case, the average accuracy does not clearly represent the performance of the method.Therefore, we collect the outliers into a separate group and introduce the accurate segmentation ratio SG ratio defined as the percent of the successful results relative to the entire number of trials.The segmentation is considered successful if H 3 ≤ 2%.Only the successful cases are used to calculate the evaluation metrics.In order to show the actual accuracy, the results include a SG ratio .

Image Categorization
Conventional categories of the US images (Figure 5) usually are not suitable for edgemap-based models.Therefore, in addition to the above, we categorize the edge maps.The proposed categories allow us to train and assess the performance of the algorithm and identify the directions of improvement.Note that characterizing the complexity of the US images by signal-to-noise ratio (SNR), contrast-to-noise ratio (CNR), or similar measures is often pointless.Many US artifacts are not the noise.They represent a real ultrasonic echo of an actual human tissue rather than a noise produced by the equipment or the environment.

Image Complexity
We define (1) the ratio of the total length of the gaps to the length object boundary L g , (2) the ratio of the max gap to the length of the boundary L g,max .The complexity C relative to L g is defined using the standard deviation as follows.If L g < L , then complexity C = B (baseline), otherwise C = T (tough).Here, L g = L g + σ g , where L g represents the mean and σ g the standard deviation.The complexity relative to the maximum boundary gap S max is defined similarly.The complexity of the shape S is defined by the difference between the area of the tumor and the area of the smallest embedding circle.Based on the above definitions, the complexity of the US images is encoded by S|C max |C, where S, C and C max is either B-easy or T-difficult.For example, B|B|B stands for the simplest images, while T|T|T stands for the most complex cases (Figure 6).
Consider the impact of L g and L g,max on the efficiency of the US segmentation.Tables 3 and 4 show the results for B|T|B and B|B|T.Clearly, the maximum size of the gap L g,max has a greater impact on accuracy than their total length.In particular, eight methods for B|B|T have SG ratio = 100% and three methods fall below 90%.However, in the case of B|T|B only six methods remain in the 100% category (including the proposed MADRL).Moreover, the number of successfully segmented images drops below 90% for the five methods.The lowest Dice BS-EL = 65.38 with H 3,BS-EL = 2.6.The results for the complexity B|T|T are given by Table 5.It combines a large total gap length, a large maximum gap, and a simple shape S=B.Only SR-LS, PK-SP, SC-SP, SS-GAN, S-U-NET, and the proposed MADRL achieve the segmentation rate 100%.Nonetheless, the RL has only a marginal improvement over the S-U-NET.For instance, DICE S-U-NET = 92.02and a H 3,S-U-NET = 0.74%, while the Dice MADRL = 93.05 and a H 3,MADRL = 0.53.
Table 6 and 7 shows the significance of the tumor shape.Even T|B|B presents a challenge.The occurrence of a leakage at the boundaries and irregularity of the geometry of the tumor reduce the success rate.The exception is S-U-NET and the proposed MADRL.Note that SR-LS, PK-SP, SS-GAN, and S-U-NET maintain an acceptable success rate above 90%.However, the proposed method outperforms them with SG = 100% the best Dice MADRL = 91 and the best H 3,MADRL = 0.61.
The complexity T|T|B indicates an irregular shape and a considerable edge leakage (Table 8).This configuration finally breaks S-U-NET although its success rate remains above 90%.The ACs and LSMs fail.SR-LS achieves a modest SG ratio = 73% .The edge-linking methods BS-EL and CR-EL fail with a success rate below 30%.However, MADRL stands out with SG ratio,MADRL = 100%, Dice = 89.6 and H 3 = 0.84.The most difficult type, T|T|T (Table 9), shows a low success rate for the majority of competing methods.The edge linking methods BS-EL and CR-EL have an unacceptable SC ratio = 0.The complexity of the shape combined with the edge leaks leads to the failure of the AC and LS models.The minimum and maximum success rates of the ACs are 13% and 27%, respectively.The LS is slightly better.The minimal success rate SG ratio,DR-LS = 36% and the maximum SG ratio,SR-LS = 63%.One of the main disadvantages of the deformable shapes is a strong dependence on the initialization and the inability to handle strong edge leaks.The T|T|T images are irregular, having acoustic shadows and artifacts.As the result, the edge detector generates strong false boundaries and the edge-based AC and LS produce a poor segmentation.
Further, the drawback of the superpixel methods is a potential loss of the true edge if it gets inside a generated superpixel.SC-SP and PK-SP show a success rate of less than 60%.Their Dice < 80.Our main competitor S-U-NET has SG ratio = 77%.We conjecture that this is due to the complexity of the shape, insufficient training data, and the inability of the DL methods to capture the global information.This comment applies to all ML and DL methods presented in the tables.Note that only the S-U-NET achieves a reasonable Dice = 79, which is slightly below the threshold of 80. Further, H 3,S-U-NET = 1.44% is also an acceptable result.However, MADRL has a significantly higher SG = 98%, Dice = 90, and H 3 = 0.83.
One of the main reasons for the failure of the DL methods is that the algorithms do not understand the global context.As opposed to that, MADRL combines the DL which allows for the automatic feature extraction with self-trained reinforcement agents capable of communicating the information throughout the entire population.Another reason is a lack of annotated data.Usually, a pure DL model requires one-tenth of thousands of samples.However, the proposed model employs only about 2000 annotated images.

Scalability and Parameters of the Agent Locomotion
Scalability is the measure of the decrease in performance of the model in response to changes in the scale of the input data.The numerical experiments show that the computational time is proportional to the length of the boundary of the tumor.Figure 8 shows the average computational time for the tumor length ranging from 500 to 5000 pixels.Roughly, this corresponds to the images ranging from 100 × 100 to 1000 × 1000 pixels.Fitting the data to the interpolating polynomial produces the best fit for the linear function.The optimal values of the reward weights introduced in Section 3.4 obtained by training are given in Table 10 below.The table shows that the continuity measured by the average curvature (continuity 1) and the angle of collision of the CW and CCW agents (continuity 2) constitute the most important rewards.The proximity is important as well.
However, if the boundary gap is large but two pieces of the boundary fit into a smooth curve, it is rewarded accordingly.The model is implemented in Python 3.8 in OS Linux with the standard Intel Core i7-10700 processor with 16 GB RAM and NVIDIA GTX 1080 graphics.

Segmentation Ratio and Standard Deviation
The heatmap in Figure 9 shows the segmentation ratio of MADRL compared to the best eight competing methods, namely, S-U-NET, SS-GAN, SR-LS, PK-SP, SC-SP, CB-LS, ML-AC, and DR-LS.All models have a score of 100% at the simplest levels B|B|B and B|B|T.At the level B|T|B, every model except ML-AC and DR-LS maintains 100%.However, as the complexity increases from B|T|T to T|T|T, the models display a significant decline.However, MADRL stands out as the most robust model, upholding 100% until T|T|B (level 7) and achieving 97.6% even at the most challenging level T|T|T.While the DL models S-U-NET and SS-GAN come close to MADRL, none of the models match MADRL's consistently high performance.In contrast, DR-LS shows the fastest decline, with scores dropping to 77.78 at level B|T|T, plummeting to 36.36 at T|T|T.Consequently, while all other models' performance tends to decline with increasing complexity, MADRL remains consistent across the eight levels.Figure 10 presents the Dice index and its standard deviation (SD) for the best eight segmentation models at the complexity levels.MADRL demonstrates stable performance, as indicated by the lowest SDs.The p-values of the difference between the second best method and MADRL show an excellent statistical significance (p < 0.001) for Dice and H 3 in the last four categories.

Conclusions and Future Work
A new multi-agent reinforcement learning (MADRL) algorithm tailored for breast US segmentation has been proposed and analyzed.The combination of the DL with the tracing RL agents demonstrates significant improvements over the existing state-of-the-art.The method shows an excellent performance across a variety of complex US images, effectively overcoming challenges, such as tumor heterogeneity, boundary leakage, and the global shape context problem.Notably, unlike the CNNs, which learn to extract patterns and features from the data, the MADRL understands the structure of the image through the dynamics of the agents.
The proposed model is motivated by the autonomous vehicles of Braitenberg [5] aimed to simulate neuro-mechanisms in biological brains.The ideas of Braitenberg have been successfully adapted to the segmentation of US images.The new reward system based on the Gestalt Laws of shape perception has been developed and implemented.
The system includes two continuity measures, proximity measure, density, and closure measure.The numerical training shows that 50% of the rewards can be attributed to the continuity and 20% to the density measure.The final solution regarding the locomotion of the agents is generated by the DNN, which observes the entire population and knows their goal.This global knowledge of the image has been achieved using the dynamics of the agents and their ability to exchange information.Another source of information is the classification of the training images into eight categories.Therefore, the system is fed by samples with increasing complexity.This allows the DNN to adapt to the image features using an iterative approach, i.e., start from the simplest category and finalize the training at the hardest level.
The numerical experiments test the proposed MADRL against 13 benchmark methods.The analysis of the results shows that the majority of the competing algorithms perform satisfactorily at the first 3-4 levels of complexity.The complexity of the tumor shape has the most profound impact on the accuracy of the results.When the shape of the tumor changes from simple to complex, 12 benchmark methods from 13 fail for some images.The success rate changes from 45% for conventional edge linking to 96% for a variant of the generative adversary network [36].Only S-U-NET [37] retained the original 100% success rate with the Dice coefficient 88%.The success rate S-U-NET drops below 100% only when the shape of the tumor is complex and the max boundary gap is large and when all three criteria indicate high complexity.In this case, SG ratio,S-U-NET = 93% and 91%, respectively.However, MADRL outperforms its main competitor, keeping SG ratio,MADRL = 100% and Dice ≥ 88%.
The reward function is a variant of the mathematical representation of the Gestalt Laws of shape perception.The numerical experiments with the reward function show that the continuity of the trajectory constitutes the most important reward (over 50%).The density (a variant of proximity) is the second important parameter (around 19%).
The model shows the lowest standard deviation and an excellent statistical significance in terms of the p-values.
The computational time depends on the size of the tumor.Our preliminary experiments show that the computational time is a linear function of the length of the tumor boundary.In other words, the model shows excellent computational performance.The average time required for the 500 × 500 image ranges from 5 to 20 s.The model processes an image 1000 × 1000 at 30 s.
Note the limitations of this study.While our method has shown significant progress in segmenting US tumors, the results may vary depending on the quality of the images and the complexity of the tumor when the complex images are from an unseen database.The model has many training parameters.Most likely they require substantial re-training and transfer learning if applied to different types of medical imagery, such as MRI, CT-scans, etc.However, the design of a universal segmentation model is outside the scope of this study.One of the specific features of the US images is a sufficiently smooth boundary of the tumor (although globally the shape can be very irregular).This helps the agents accurately approximate the boundary.The model requires modifications if applied to objects with many sharp turns.In this case, the agents "wiggle" at the corners.This requires modifications and specific training.
Our future research focuses on optimizing this approach for real-time segmentation and exploring its potential for classification of the breast tumors.Another goal is to apply it to US imagery of different human organs.The most probable candidates are the US images of thyroids.From the viewpoint of computer science, one of the most interesting questions is the relationship between the number of agents, the length of the boundary, and the complexity of the tumor (eight categories).This open problem requires further research and massive numerical experiments.From the viewpoint of clinical application, the speed remains one of the most important factors.Doctors do not want to wait 20 s.The result must appear immediately on the screen, i.e., 2-3 s.However, the simplicity of the model mentioned above makes it possible to conjecture that this goal is achievable.

Figure 3 .
Figure 3. Key elements of the system.

Figure 4 .
Figure 4. Backbone network of the model, red dots-agents, green AL region.

Figure 6 .
Raw US images and corresponding edge maps.(a) Simple cyst B|B|B, (b) medium complexity-B|T|T, (c) high complexity T|T|T.

Figure 7 .
Figure 7. Segmentation of the tumors from different categories: white for ground truth, green for segmentation results.
The coefficients at L 2 b and L 3 b are less than 10 −7 .Therefore, t = O(L b ).The relationship between the number of agents, the length of the boundary, and the complexity of the tumor (eight categories) is an open problem that requires further research and massive numerical experiments.

Figure 8 .
Figure 8. Computational time as a function of the length of the tumor for a variable number of the agents.

Figure 9 .
Figure 9. Heatmap of the segmentation ratio for different complexities.The best eight models.

Figure 10 .
Figure 10.Dice index and standard deviations (whiskers) for different complexities.

Table 1 .
Hyperparameters of the training algorithm.

Table 2 .
Multi-agent deep reinforcement learning vs. state-of-the-art segmentation, complexity level B|B|B.

Table 3 .
Multi-agent deep reinforcement learning vs. state-of-the-art segmentation, complexity level B|B|T.

Table 4 .
Multi-agent deep reinforcement learning vs. state-of-the-art segmentation, complexity level B|T|B.

Table 5 .
Multi-agent deep reinforcement learning vs. state-of-the-art segmentation, complexity level B|T|T.

Table 6 .
Multi-agent deep reinforcement learning vs. state-of-the-art segmentation, complexity level T|B|B.

Table 7 .
Multi-agent deep reinforcement learning vs. state-of-the-art segmentation, complexity level T|B|T.

Table 8 .
Multi-agent deep reinforcement learning vs. state-of-the-art segmentation, complexity level T|T|B.

Table 9 .
Reinforcement learning vs. state-of-the-art segmentation, complexity level T|T|T.