CommuSpotter: Scene Text Spotting with Multi-Task Communication

: Scene text spotting is a challenging multi-task modulation for locating and recognizing texts in complex scenes. Existing end-to-end text spotters generally adopt sequentially decoupled multi-tasks, consisting of text detection and text recognition modules. Although customized modules are designed to connect the tasks closely, there is no interaction among multiple tasks, resulting in compatible information loss for the overall text spotting. Moreover, the independent and sequential modulation is unidirectional, accumulating errors from early to later tasks. In this paper, we propose CommuSpotter, which enhances multi-task communication by explicitly and concurrently sharing compatible information in overall scene text spotting. To address task-speciﬁc inconsistencies, we propose a Conversation Mechanism (CM) to extract and exchange expertise in each speciﬁc task with others. Speciﬁcally, the detection task is rectiﬁed by the text recognition task to ﬁlter out duplicated results and false positives, while the text recognition task is corrected by the rectiﬁed text detection task to replenish missing characters and decrease non-text interruptions. Consequently, the communication compensates for interaction information and breaks the sequential pipeline of error propagation. In addition, we adopt text semantic segmentation in the text recognition task, which reduces the complex design of customized modules and corresponding extra annotations. Compared with state-of-the-art methods, experimental results show that our method achieves competitive results with computation efﬁciency.


Introduction
Scene text spotting aims to detect and recognize various texts in complex scenes simultaneously.Derived from general object detection, scene text detection faces many challenges, such as scattered and changing characters in different word detection, text style and font variations, background interruptions, etc.On the other hand, scene text recognition also incorporates language processing techniques to decode sequence features for text prediction.The end-to-end training and application of scene text detection and recognition have garnered increased research interest in recent years.It not only integrates the two main tasks of text detection and recognition, but it also bridges the training schemas of different tasks for optimization.It has wide applications in artificial intelligence and computer vision, such as criminal investigation [1], video information retrieval [2], robotic assistance [3], and autonomous driving [4].Modern scene text spotting comes up with challenges, such as artificially embellished texts, various text perspectives, and complex text shapes.
Existing two-stage paradigm text spotters conduct text detection tasks for text localization and text recognition tasks for words, while one-stage paradigm text spotters remove text detection but introduce customized modules for connecting the backbone with text recognition modules.The configuration examples of typical text spotters are collected in Table 1.Transformer-based backbones and text detectors greatly simplify proposal generation and boost performance at the expense of the computation cost.The customized modules are specifically designed to shrink the text instances or refine the features.
Although current end-to-end scene text spotters [5][6][7][8][9][10][11][12] have achieved substantial progress, there are three limitations.Firstly, the information or expertise from various multitasks compensates for each other in the final text spotting, but some expertise is lost in the current independent multi-task methods.For example, many studies improve information usage by sharing the network backbone [9,12,13], integrating the recognition loss [14,15], or bridging customized modules [5][6][7][8], as shown in Figure 1a.However, the interaction between tasks is not adequately identified.As indicated by the yellow arrow, there is only compensation from the text detection task (or customized modules) to the text recognition task, but not inversely, where the expertise from customized modules or text recognition is scarcely utilized to interact with early text detection tasks or proposal generations.There is no communication or usage of each other's information for multi-tasks in the text-spotting process.Recently, some text spotters [14,15] have adopted a Transformer to understand the relationship between text instances but not between tasks.As a result, independent multi-task modulation leads to the loss of concurrent compensatory expertise from multiple tasks in text spotting.Secondly, apart from task-specific information loss, pipeline errors accumulate in current unidirectional text spotters.Errors from early tasks are not identified and, thus, accumulate in later tasks, leading to back-and-forth training.During inference, the methods become unreliable from the start when there is no ground truth in each step.Thirdly, the sophisticated customized modules increase model complexity and require expensive extra annotations.For example, some customized modules involve designs and labels for text lines [16], text strokes [17], text center points [18], and so on.In this work, we propose CommuSpotter, which explicitly and concurrently communicates multi-task expertise with others.Multi-task communication involves information from all tasks complementing and assisting each individual task.Drawing inspiration from the image retrieval task [11], we develop a Conversation Mechanism (CM) to embed expertise vectors for multi-task interaction.Thus, this communication of expertise facilitates information exchanges among text detection and recognition modules, as shown in Figure 1b, breaking the unidirectional pipeline and reducing accumulated errors from the beginning.Additionally, we introduce text semantic segmentation for text recognition without requiring further annotations.This is achieved by adopting the independently pre-trained module weights as priors, replacing complex customized modules and their extra annotations.Finally, CommuSpotter constructs a concise framework and achieves fast convergence.The contributions of this paper are threefold: (1) Instead of independently and sequentially performing text detection (or customized modules) and text recognition multitasks, we developed CommuSpotter, which facilitates explicit communication through the designed Conversation Mechanism (CM).This mechanism embeds task-specific expertise concurrently across all tasks, compensating for task-specific information or expertise and reducing error propagation throughout the entire text-spotting process.(2) We employ text semantic segmentation expertise for text recognition tasks, reducing the need for complex custom module designs and their associated costly annotations.(3) We conduct comprehensive experiments on multiple text datasets.The comparisons between existing approaches demonstrate the advantages of our proposed method.

Scene Text Spotter of Two-Stage Paradigm
To address the challenges of arbitrary texts, Lyu et al. [13] developed Mask TextSpotter v1, which includes long short-term memory (LSTM) [20] in text recognition tasks to boost spotting results.In Mask TextSpotter v2 [19], text and character instance segmentations are adopted to improve recognition performance.Qiao et al. [21] predicted additional latent information from text detection tasks to enhance text instance segmentation.Qin et al. [5] developed Regions of Interest (RoIs) masking to improve segmentation accuracy by selecting and fusing features for instances.These studies focused on improving the text recognition task.Some of the following focus on improving the early text detection task.FOTS [9] and TextNet [22] adopt rotating RoIs and perspective RoIs to handle irregular and multioriented text detections, respectively.CRAFT [23] groups character region features from the text detector to reinforce character attention for the recognizer.Kittenplon et al. [15] adopted Transformer to improve the representation from the shared backbone, while text detection and recognition tasks were parallel and independent.SwinTextSpotter [14] adopts the Swin Transformer [24] in the backbone for better text representations.The expertise from the text detection task is transferred to the text recognition task, but there is no inverse interaction.In addition, errors accumulate from early tasks to later tasks.Furthermore, some dataset annotations are tailored to specific methods, such as character and polygon annotations [13,25], text-line annotations [16], etc.

Scene Text Spotter of One-Stage Paradigm
Many customized modules have been developed to replace traditional text detection with bounding boxes, thereby building a closer connection between multiple tasks.For some segmentation-based methods, Mask TextSpotter v3 [6] uses segmentation proposals for arbitrarily shaped text recognition.Liu et al. [8,26] designed Bezier curves to represent text instances.MANGO [7] extracts text center lines and text and character segmentation maps for text grouping and recognition.For some regression-based methods, Wang et al. [27] designed the instance boundary points to improve text instance shapes.TextDragon [28] combines sliding RoIs with local points.SRSTS [29] adopts anchor points in text recognition.These methods replace bounding boxes with accurate text boundaries for text recognition, but customized modules are also little rectified by later expertise in text recognition, similar to two-stage paradigms.Again, the errors are propagated.Customized modules also require extra annotations, such as character annotations [6], stroke annotations [17], and so on.

Scene Text Spotter with Back-Propagation
Some studies show backward compensation among tasks but only build conversion by recognition loss.Zhong et al. [30] adopted a spatial transform network (STN) to propagate the recognition loss back to the text detection task.SwinTextSpotter [14] proposes the synergy mechanism for joint optimization by backpropagating the recognition loss.However, the loss function does not facilitate concurrent expertise interaction among sequential tasks, and it is not effective during testing in encoding representations.

Text Spotter with Communication
Scene text spotting mainly consists of text detection and recognition tasks, as shown in Figure 1.For the text detection task, the image is fed into a CNN-based network to extract feature maps, which are then pooled by MaxPooling and So f tmax layers for the coordinate prediction of text locations and classification of text detection.The metrics include accuracy, recall, and F-score of text classification, as well as location overlap with the ground truths.Due to challenges in scene text variations, detection often contains many false positives, leading to incorrect recognition based on those detections.Text recognition relies on RNN networks that sequentially process the feature maps to predict each word character in text instances.For a maximum length of 25 for each word, each character is predicted and grouped to form the final recognition results.The metrics include recognition accuracy, recall, and F-score for every word.

Architecture
The whole framework of CommuSpotter is shown in Figure 2. It contains text detection and text recognition modules.Given an image, I, a backbone of ResNet-50 is used to extract feature maps, P, denoted as {P 2 , P 3 , P 4 , P 5 , P 6 }, with a Feature Pyramid Network (FPN) following [31].The Region Proposal Network (RPN) generates some Region of Interest (RoI) proposals, R, in five scales of {32 2 , 64 2 , 128 2 , 256 2 , 512 2 } on the pyramid features.The redundant candidates are filtered out by Non-Maximum Suppression (NMS).The pyramid features, P, and proposals, R, are processed by the RoIAlign [32] layer to generate the RoI features, F RoI and F RoI , which are then fed into the text detection module and text recognition modules.Moreover, the semantic expertise, S and S , from the feature maps are extracted via softmax operations and attention-based refinements.
In the text detection module, the RoI features, F RoI , interact with the expertise from the text recognition module through the specially designed Conversation Mechanism (CM) (in Section 3.2) to generate meaningful text detection through the classification layer.In the text recognition module, we adopt a segmentation-based method for text instance segmentation as the word expertise, as described in Section 2.2).Additionally, we introduce text semantic segmentation expertise S and S as character details for text recognition.The representation is communicated by our designed Conversation Mechanism (CM) (in Section 3.3).Finally, text sequences are recognized by the RNN mechanism [33].Note that the contents in this Section 3.1 are existing pipelines of scene text spotting, Sections 3.2 and 3.3 below are our proposed approaches.

Text Detection Communication
Generally, the modality interaction [34] is either conducted by concatenating features and interacting together or by directly conducting interactions among features.Different tasks generate information from different perspectives for the final text spotting in text spotters.Existing text detection and recognition modules share the same backbone for different tasks.Considering the modeling efficiency, we keep the same backbone but introduce Extensive Representation (ER) on pyramid features for different tasks, as shown in Figure 3a.Prior to applying RoIAlign, as detailed in [32], which builds local coherence for feature alignments, we embed high-level pyramid features on a large receptive field for text detection.Specifically, the shared pyramid features P 2 to P 5 are updated as follows: where P 2 is the shape of C × H/4 × W/4, P 3 is the shape of C × H/8 × W/8, P 4 is the shape of C × H/16 × W/16, and P 5 is the shape of C × H/32 × W/32.C represents the dimension of channels, set at 256, while H and W denote the input resolutions in terms of height and width, respectively.The operation Ur(•) is used for up-sampling.Gcm(•) denotes a sequence of convolution layers, consisting of 1 × 3 and 3 × 1 convolutions.They are applied to reduce dimensionality gradually.Finally, the results are combined as P * , consisting of P * 2 to P * 5 .Compared with the general object detection of integrated targets, split characters inside texts always cause false positive detection results due to background interruptions and character-like non-texts.To achieve concurrent semantic guidance, we developed a Conversation Mechanism (CM) through a series of straightforward operations.This mechanism incorporates the expertise of a text recognition module, comprising (1) text semantic segmentation expertise S (as detailed in Section 3.3) to identify character objects (other than non-texts) and (2) text instance segmentation expertise F to construct exact locations.Specifically, the refined features, P * , and proposals, R, are fed into the RoIAlign with text semantic segmentation expertise S to generate the aligned RoI feature F RoI of the shape N × C × H RoI × W RoI , where N is the number of mapped proposals, and H RoI and W RoI are feature map resolutions set to 7. To obtain text instance segmentation expertise F , we conducted a sequence of four convolutional layers Conv and one transposed convolutional layer TrConv on the RoI features F RoI .Then, the expertise was combined.

Text Recognition Communication
As mentioned above, a segmentation-based text recognition module adopts text or character instance segmentation.Similar to the text detection module, the shared features are first extended by the reverse Extensive Representation (ER) to embed global information, as shown in Figure 3b.Specifically, the filtered pyramid features, P, consisting of multiple levels, from P 2 to P 5 , are presented for the module as follows: where Dr(•) is the downsampling operation.The instance segmentation-based recognition methods always suffer from the problem of missing characters in texts, especially when some characters inside a text are artistically designed with different textures, shapes, fonts, etc. From this view, the purpose of the Conversation Mechanism (CM) is to integrate (1) the renewed expertise from text detection module F as global guidance and (2) text semantic segmentation expertise S , which provides local details.Specifically, the extensive features, P * * , and filtered proposals, R , from RPN are fed into the RoIAlign with text semantic segmentation expertise S , to generate the aligned RoI feature, F RoI , of the shape N × C × H RoI × W RoI , where N is the number of instances, and H RoI and W RoI are the resolutions of 16 and 64, respectively.
Then, the aligned RoI feature, F RoI , is transferred by two consequent linear convolutional layers Lr(•).Finally, we concatenate text detection expertise F with the RoI features F RoI .
Another problem of segmentation-based recognition is that character details are ignored in text spotters.The pixel-level expertise of characters can provide information to differentiate characters of similar shapes.For example, "hot" and "hat" only differ in one character's details.Due to the lack of pixel-level annotations for scene text datasets, text semantic segmentation is not applied in existing scene text spotters.Instead of designing complex modules to tighten the character regions with extra annotations, we introduce text semantic segmentation [35] expertise S and expertise S with individually pre-trained weights.The initial semantic segmentation results, S e , are refined by the attention mechanism [36] to obtain a better text semantic segmentation S re , as shown in Figure 2.
S e = So f tmax(S f ), S att = Att(S e , S f ), S re = Convolution(S e , S att ), where S f is the fusion of pyramid features P 2 to P 5 , So f tmax is the softmax operation, Att is an attention layer of the dot-product operation, and Convolution is a series of convolution layers of kernel size 5 × 5 and 1 × 1, to fuse the attention with the feature maps.To communicate this expertise in the text spotter, we take pooling operations on text semantic segmentation S re with different resolutions.Then, text semantic segmentation expertise S and expertise S are generated.
where P and P are the pooling layers with different resolutions.The shape of expertise

Optimization
The whole framework loss L is defined as follows: where L rpn , L rcnn , and L rec are the losses of RPN [13], text detection module, and text recognition module, respectively.The weights of α and β are equal to 1.0 and 1.0, respectively.The detection module loss is L rcnn , consisting of the cross-entropy classification loss L cls and the smooth L1 regression loss L reg [37].The text recognition module loss L rec includes a cross-entropy text instance segmentation loss L ins , a character instance segmentation loss L seg , and a sequence recognition loss L seq [13].The weights of δ and are empirically set to 1.0 and 0.2, respectively.The L seq follows a summation of the logarithm loss [13].The text instance map is encoded by convolutional and max pooling layers with two-dimensional representations [38], and fed into a seq2seq recognizer [39] to generate text sequences.

Experiments 4.1. Datasets
SynthText is a synthetic dataset [40] of around 800 k images with comprehensive text samples used for pre-training.ICDAR2013 (IC13) is for the 2013 Robust Reading Competition [41], consisting of 229 training images and 233 test images.ICDAR2015 (IC15) is provided by the 2015 Robust Reading Competition [42].It contains incidental scene texts of 1000 training images and 500 test samples.Total-Text (TT) [43] focuses on arbitrarily shaped texts, including 1255 training and 300 test images.SCUT [44] contains 1162 natural images from Flickr [13].These real data are used for fine-tuning the model.The evaluation is conducted on IC15 and TT, current scene text-spotting benchmarks.

Implementation Details
The model is trained using PyTorch with two Tesla-V100 GPUs and tested on a single GPU.Following the Mask TextSpotter v2 [19], the training process consists of two parts: pre-training and fine-tuning.The optimizer is Stochastic Gradient Descent (SGD), set with a weight decay equal to 0.001 and a momentum of 0.9.In the pre-training stage, the model is trained on the SynthText [40] for 270 K iterations.The initial learning rate is set to 0.01 and decayed at every 90 K iterations by a tenth.In the fine-tuning stage, the model is trained on multiple real-world image datasets for 90 K iterations.The learning rate is set to 0.001.In the inference stage, the input images are fed into the model to generate proposals, instances, and recognition predictions.

Ablation Study
The experiments adopt percentage values of the end-to-end recognition F-scores on the ICDAR15 dataset with a strong lexicon.We train a model that only adopts the Extensive Representation (ER) mechanism from the baseline Mask TextSpotter v2 [19].The ER improves the recognition performance from 82.1% to 82.5% in Table 2. To validate the effectiveness of communication between text detection and instance segmentation tasks, we add the Conversation Mechanism (CM) to them but without expertise from text semantic segmentation.The recognition results indicate that concurrent communication can improve performance from 82.1% to 83.7%.If equipped with the above ER, it can achieve 84.6%.Finally, with full settings of CM, we add the interaction from the text semantic segmentation.The F-score is further improved from 84.6% to 85.8%.Thus, there is a collaborative effect for the entire text spotter.
We train and fine-tune several models of the configurations on the Total-Text dataset, as shown in Table 3.The ER improves the spotting performance from the baseline Mask TextSpotter v2 [19] of 77.4% to 78.4%.The singular CM-P can improve the performance from 77.4% to 80.3%.Together with ER, the recognition results indicate that concurrent communication can improve performance to 81.7%.With full settings of CM, we add the interaction from text semantic segmentation.The F-score is further improved to 83.4%.As mentioned above, some methods adopting Transformer or customized modules may train the model on extra datasets or make corresponding extra labels for the datasets.It causes difficulty in the fair comparison of different approaches.We compare the model efficiency in Table 4.For the abbreviation of datasets, "CST" is Curved SynthText [8]; "COCO" is COCO-Text [45]; "MLT" is ICDAR-MLT [46]; "IC13" is ICDAR2013 [41]; and "CTW" is SCUT-CTW1500 [47].We compare the costs of computation resources with some approaches in Table 4.For the maximum iterations of convergence in the training process, we only need 360 K iterations in total, which is almost the fastest model.The GPU hour is roughly estimated from previous papers or re-implementation.Our GPU hour is a little more than the most lightweight ABCNet [8], but our performance is better than that work.As a result, our method can achieve competitive performance at a lower cost compared to previous approaches.It is a good trade-off between model performance and computational efficiency.

Incidental Texts
All the following comparison metrics on the text-spotting datasets are percentage values.We evaluate the method on incidental texts of IC15 [42].The evaluation results are shown in Table 5.Our method achieves a state-of-the-art recall of 88.3% and an F-score of 89.8% compared with previous studies.For the end-to-end recognition evaluation, our method outperforms the previous non-Transformer methods in the strong and generic lexicon.The F-score achieves 85.8% and 74.9%.Compared with recent Transformer based methods, it is hard to distinguish the effects of the Transformer from the modulation.Also, TextTranSpotter (TTS) [15] achieves a higher F-score for the generic lexicon when trained with 43 K more images compared with our 4 K images.Our results demonstrate the effectiveness of communication among multi-tasks for end-to-end text spotting in incidental texts.Apart from the Transformer, this represents a promising exploration into the simple and clean modulation of text spotters.
Table 5.Comparison results on the ICDAR2015 dataset.For the detection result, "P", "R", and "F" represent the metrics of precision, recall, and F-score, respectively.The end-to-end recognition evaluation is the F-score."S", "W", and "G" means strong, weak, and generic lexicons, respectively.

Arbitrary Texts
To verify the effectiveness of the model on irregular texts, we conduct experiments on the Total-Text (TT) [40], following the detection and end-to-end evaluation protocols publicized with the dataset.The comparison results are presented in Table 6.Our detection results achieve state-of-the-art performance compared with non-Transformer methods and achieve the best recall and F-scores compared to recent Transformer methods.The recognition results outperform previous non-Transformer text spotters while not as good as Transformer methods.However, the non-lexicon F-score of 73.0% is significantly improved from the baseline Mask TextSpotter v2 [19] of 65.3%.The recognition result of 83.4% with lexicon outperforms the corresponding baseline of 77.4%.Compared with SwinTextSpotter [14] of the ResNet backbone (shown as SwinTextSpotter-R), our recognition results achieve better performance.It shows that without a Transformer, our communication mechanism can be better in modulation.The newest SRSTS [29] achieves the best recognition results with Transformer decoders, which indicates that we still have improvement space in the recognition module.

Inference Speed
The inference speed is compared with previous approaches, as shown in Table 7.Not all recent studies provide inference time statistics.Compared to MANGO [7], our method is a little slower but with better performance.As mentioned above, our method is more efficient at training and can be easily used in practical applications without complex modules and expensive data labeling.

Qualitative Results
We can see the improvements in our approach (second row) compared with the baseline Mask TextSpotter v2 [19] (first row) from Figure 4.For the first two examples with complex scenes, the original method always catches false positive detections and obtains wrong recognition results.That is, decorated curves are wrongly detected as texts and recognized as non-existing information.For the last two examples of arbitrarily shaped texts, the text shapes and background interruptions always cause incomplete text instances.For example, the curved character layout is easily spotted as duplicated text but not whole instances.As a result, there are many unclear recognized meanings in the scene.There are fewer errors in our method in the second row.We present more qualitative samples in the third and fourth rows of Figure 4 for comparison.

Conclusions
We proposed a streamlined framework to facilitate the exchange of expertise among multiple tasks in the scene text spotter.With the bidirectional communication between text detection and text recognition modules, our CommuSpotter allows for concurrent expertise exchange and early error correction.Instead of a complex, customized module design for tight character regions, we introduce text semantic segmentation in the recognition module.We conduct experiments on incidental and curved text datasets; the proposed method achieves consistently competitive performance with model efficiency.

Figure 1 .
Figure 1.An illustration of typical text spotters (a); frameworks compared with our proposed spotter with communication mechanisms in multi-tasks (b).

Figure 2 .
Figure 2. Architecture of the proposed CommuSpotter.From the backbone, the pyramid features, P, and the proposals, R, are processed by text detection and text recognition modules."ER" denotes Extensive Representation, and Conversation Mechanisms are designed for different communications.From the backbone, text semantic segmentation expertise is generated for the text recognition module.Multiple forms of expertise interact with each other concurrently.The black arrows indicate the forward flow of information, while the colorful arrows represent the backward interaction of multitask expertise.

Figure 3 .
Figure 3.The communication networks of (a) the text detection module and (b) text recognition module.

Figure 4 .
Figure 4. Comparison of qualitative samples on ICDAR2015 and Total-Text datasets.The bounding boxes or polygons and recognized texts in red and pink colors are the wrong results, while those in green are the correct ones.

Author Contributions:
Conceptualization, L.Z., G.W. and S.W.; methodology, L.Z.; software, L.Z.; validation, L.Z; writing-original draft preparation, L.Z.; writing-review and editing, L.Z., G.W. and S.W.; supervision, S.W.; funding acquisition, G.W. All authors have read and agreed to the published version of the manuscript.Funding: This research was funded by the XSEDE Program of the National Science Foundation and the Aspire-II Research Program at the University of South Carolina.

Table 1 .
Comparison between typical models.There are mainly two categories of methods divided by Transformer architecture."CNN" is the traditional ResNet backbone."Trans" stands for Transformer architecture."Seg" is equal to "segmentation" for short."Att" is the attention-based loss, while "CTC" is the CTC loss.Specific module techniques and details can be found in each method.

Table 2 .
End-to-end recognition results of ICDAR15 on different model configurations."CM-P" stands for part of the Conversation Mechanism without text semantic segmentation expertise.

Table 3 .
End-to-end recognition results of Total-Text on different model configurations."CM-P" stands for part of the Conversation Mechanism without text semantic segmentation expertise.

Table 4 .
Comparison between training configuration and computation cost.In the pre-training, mix-training, and/or fine-tuning stages, the word "Data" represents the datasets used for training, and "Iter."denotes the number of convergence iterations needed."GPU" denotes the estimated GPU hours for each method.

Table 6 .
Comparison results on the Total-Text (TT) dataset.The detection results are measured by precision (P), recall (R), and F-score (F).The recognition F-scores include evaluation without lexicon as "None" and with lexicon as "Full".

Table 7 .
Frames per second (FPS) comparison on different inference datasets.