Identification of the Knowledge Structure of Cancer Survivors’ Return to Work and Quality of Life: A Text Network Analysis

This study aimed to understand the trends in research on the quality of life of returning to work (RTW) cancer survivors using text network analysis. Titles and abstracts of each article were examined to extract terms, including “cancer survivors”, “return to work”, and “quality of life”, which were found in 219 articles published between 1990 and June 2020. Python and Gephi software were used to analyze the data and visualize the networks. Keyword ranking was based on the frequency, degree centrality, and betweenness centrality. The keywords commonly ranked at the top included “breast”, “patients”, “rehabilitation”, “intervention”, “treatment”, and “employment”. Clustering results by grouping nodes with high relevance in the network led to four clusters: “participants and method”, “type of research and variables”, “RTW and education in adolescent and young adult cancer survivors”, and “rehabilitation program”. This study provided a visualized overview of the research on cancer survivors’ RTW and quality of life. These findings contribute to the understanding of the flow of the knowledge structure of the existing research and suggest directions for future research.


Introduction
The number of cancer survivors is rapidly increasing due to early screening and improved medical technology. As the cancer survival rates increase, optimizing the healthcare provision and long term outcomes for survivors becomes increasingly imperative [1].
For cancer survivors, returning to work (RTW) signifies more than economic power. It represents returning to daily life, from which they were temporarily excluded because of the cancer diagnosis and treatment, and a significant factor affecting their quality of life through the restoration of interpersonal relationships and social status [2][3][4]. Cancer survivors resume their daily life, including RTW, during or after treatment; however, the RTW rate varies from 30.5% to 71.5% [5,6]. Many cancer survivors have difficulty RTW. Moreover, there are cases of unemployment or job turnover due to the side effects of cancer treatment and the social stigma associated with cancer patients [5]. The unemployment rate among cancer survivors differs based on their diagnosis; nonetheless, cancer survivors have a 1.4 times higher unemployment rate than healthy controls [7].
After their treatment, RTW cancer survivors experience side effects, such as fatigue, pain, restriction of physical function and activity, anxiety, and depression, which lead to difficulties in re-adaptation to work-life balance, job performance, and interpersonal relationships [3,8]. These complex factors increase the job stress of cancer survivors, hinder a successful return to work, and result in job turnover, leading to a decrease in their overall quality of life, which, in turn,

Search and Collection of Articles
Research literature was collected from PubMed, Embase, Cochrane, the Cumulative Index to Nursing and Allied Health Literature, Web of Science, and Scopus using EndNote (Clarivate Analytics, Boston, MA, USA). Articles containing "cancer survivors", "return to work", and "quality of life" in the title or abstract were selected and vocabulary combinations with similar meanings were also considered (e.g., "back to work" or "job re-entry"). After excluding duplicates, 219 articles published between 1990 and June 2020 were collected. For text mining, titles and abstracts were organized into one row and saved as an Excel file. Articles without an abstract were also excluded from the collection.

Keyword Extraction and Preprocessing
To extract keywords, punctuation and numbers in the title and abstract were removed. Afterward, the parts of speech of all the words were identified and only nouns, verbs, and adjectives were kept and lemmatization was performed to convert them to their base form. Finally, by removing stopwords and keywords directly used in the literature search such as "cancer", "survive", and "work", the titles and abstracts were reconstructed with only meaningful words. Furthermore, for a rough analysis, the frequency of words and n-grams (n consecutive word combinations) were measured [26]. For these preprocessing, we developed a tool that uses the Natural Language Toolkit library (https://www.nltk.org/; Free and open-source software) in the Jupyter notebook (https://jupyter.org/; Free and open-source software) environment based on the python programming language.

Semantic Network Analysis for Title
The semantic network is a graph that expresses words and their relationship as nodes and edges, respectively [27]. To establish word associations, this study defined word pairs appearing together in a paper's title as related to each other. Each node (word) is connected to other nodes in the network and the number of connections is called "degree" of that node. The degree value of a node indicates its importance and degree centrality quantifies how central each node is as a value between 0 and 1. Moreover, betweenness centrality is a value between 0 and 1 that indicates how much each node acts as a bridge for the connections between other nodes.
Besides these numerical measurements, network analysis was performed using a clustering technique for grouping nodes with high relevance on a network. We applied Blondel's model [28],

Search and Collection of Articles
Research literature was collected from PubMed, Embase, Cochrane, the Cumulative Index to Nursing and Allied Health Literature, Web of Science, and Scopus using EndNote (Clarivate Analytics, Boston, MA, USA). Articles containing "cancer survivors", "return to work", and "quality of life" in the title or abstract were selected and vocabulary combinations with similar meanings were also considered (e.g., "back to work" or "job re-entry"). After excluding duplicates, 219 articles published between 1990 and June 2020 were collected. For text mining, titles and abstracts were organized into one row and saved as an Excel file. Articles without an abstract were also excluded from the collection.

Keyword Extraction and Preprocessing
To extract keywords, punctuation and numbers in the title and abstract were removed. Afterward, the parts of speech of all the words were identified and only nouns, verbs, and adjectives were kept and lemmatization was performed to convert them to their base form. Finally, by removing stopwords and keywords directly used in the literature search such as "cancer", "survive", and "work", the titles and abstracts were reconstructed with only meaningful words. Furthermore, for a rough analysis, the frequency of words and n-grams (n consecutive word combinations) were measured [26]. For these preprocessing, we developed a tool that uses the Natural Language Toolkit library (https: //www.nltk.org/; Free and open-source software) in the Jupyter notebook (https://jupyter.org/; Free and open-source software) environment based on the python programming language.

Semantic Network Analysis for Title
The semantic network is a graph that expresses words and their relationship as nodes and edges, respectively [27]. To establish word associations, this study defined word pairs appearing together in a paper's title as related to each other. Each node (word) is connected to other nodes in the network and the number of connections is called "degree" of that node. The degree value of a node indicates its importance and degree centrality quantifies how central each node is as a value between 0 and 1. Moreover, betweenness centrality is a value between 0 and 1 that indicates how much each node acts as a bridge for the connections between other nodes.
Besides these numerical measurements, network analysis was performed using a clustering technique for grouping nodes with high relevance on a network. We applied Blondel's model [28], based on modularity, which organizes a network so that it has high intraconnection and low interconnection. This technique makes it possible to identify which words are highly related to each other within groups. Clustering and visualization were performed using a sub-network composed of selected main keywords only because the original semantic network was too large to present in its entirety. A Python library, NetworkX (https://networkx.org/; Free and open-source software), was used for network creation and analysis and a Gephi (https://gephi.org/; Free and open-source software) tool was used for clustering and visualization.

Hierarchical Topic Analysis for Abstract
Topic analysis is a technique based on unsupervised machine learning that infers what topic is embedded in a large number of text documents. A commonly used topic analysis technique is Latent Dirichlet Allocation (LDA) [29], which represents all topics in a flat relationship. This study applied the hierarchical LDA (hLDA) [25], which re-organizes topics into a vertical hierarchical structure to interpret large documents more easily. The hLDA uses the nested Chinese Restaurant Process [30] and LDA and is explained by the following analogy:

•
There are Chinese restaurant chains organized into a tree structure.

•
A guest eats in one restaurant and then moves to the next restaurant in the subchain.

•
There are many tables and seats in each restaurant and guests choose seats based on the popularity of the Following these assumptions, each time a new guest comes, the type of food and the popularity of the table converge on a certain value. In the hLDA, the food served by the Chinese restaurants at each level represents a topic and the guests represent the document. As guests visit the Chinese restaurants at different levels, the n-layer foods they eat correspond to the n-layer topics of the document. This allows us to identify hierarchical topic structures and to which topics each document belongs. The hLDA requires user input on how many layers a topic will be composed of and in this study, several models were created with three and four layers. Among the generated topic models, one analytically representative model was selected and a detailed analysis was performed. For topic extraction tools, genism (https://radimrehurek.com/gensim/; Free and open-source software) and hLDA (https://github.com/joewandy/hlda/; Free and open-source software) libraries were used. Table 1 presents the top 20 core keywords by frequency, edge, degree centrality, and betweenness centrality indices calculated from the main words extracted from the studies on the quality of life among RTW cancer survivors. There were 587 unique words extracted from the selected 219 research titles. Therefore, the title semantic network had 587 nodes.

Core Keywords that Emerged from the Research Titles
Regarding frequency, "breast", "patients", "rehabilitation", "intervention", and "treatment" were ranked at the top. The edges and degree centrality were also high in the order of "breast", "patients", "intervention", "rehabilitation", "employment", and "treatment". The frequency, edge, and degree centrality were shown to have similar rankings for most of the core keywords. There were minor differences in rankings; however, after the seventh place, the keywords "development", "follow", "experience", and "survivorship" were ranked relatively high in betweenness centrality compared to frequency and edge.

Semantic Network Analysis
There were 4346 word pairs, resulting in a semantic network with 587 nodes and 4346 edges. The average degree value of the network was 17.8 and we selected important nodes with a degree of 30 or higher, which represented 10% of all nodes. Figure 2 shows the semantic network diagram for the top 10% based on degree centrality.

Semantic Network Analysis
There were 4346 word pairs, resulting in a semantic network with 587 nodes and 4346 edges. The average degree value of the network was 17.8 and we selected important nodes with a degree of 30 or higher, which represented 10% of all nodes. Figure 2 shows the semantic network diagram for the top 10% based on degree centrality. It was divided into four clusters and classified by font size and color according to connectivity. The keyword "breast" was clustered with keywords such as "treatment", "survivorship", "diagnosis", "employment", "association", "impact", "symptom", "change", and "prospective" and was labeled as "participants and method". The keyword "patient" was clustered with keywords such as "intervention", "psychosocial", "control", "development", "systematic", "protocol", "trial", "support", and "randomize" and was labeled as "type of research and variables". Additionally, the keywords "adult", "young", "stem", "cell", "transplantation", "education", "early", and "experience" formed another cluster labeled as "RTW and education in adolescent and young adult cancer survivors". The keywords "rehabilitation", "occupational", "program", and "pilot" formed another cluster labeled as "rehabilitation program".

Hierarchical Topic Analysis for Abstracts
Words dealing with the same subject were assumed to appear together frequently and a topic could be inferred by identifying a set of these words. About 29,307 words appeared in the collected abstracts of 219 papers, of which 3901 were unique. Among the generated hLDA topic models, researchers selected one analytically representative model with three layers ( Table 2).
From the hierarchical topic analysis on the three layers, two topics were derived from level 1 and 11 topics were derived from level 2.

Hierarchical Topic Analysis for Abstracts
Words dealing with the same subject were assumed to appear together frequently and a topic could be inferred by identifying a set of these words. About 29,307 words appeared in the collected abstracts of 219 papers, of which 3901 were unique. Among the generated hLDA topic models, researchers selected one analytically representative model with three layers ( Table 2).
From the hierarchical topic analysis on the three layers, two topics were derived from level 1 and 11 topics were derived from level 2. The 11 identified topics were classified as follows: rehabilitation intervention (intervention, rehabilitation, trial, exercise, program), employment and symptom (employment, diagnosis, status, symptom, confidence interval), diagnosis and job status (month, diagnosis, employment, status, job), psychosocial factors (pain, fear of cancer recurrence, lymphedema, barrier, surgeon), health-related quality of life by group (health-related, confidence interval, health-related quality of life, oral, group), physical exercise (exercise, lung, physical, patient, improve), education of young patient (adult, adolescent and young adults, educational, service, young), job type and quality of working (self-employed, item, the quality of working life questionnaire for cancer survivors, module, job), analytic method (literature, search, criterion, systematic, productivity), cost (engagement, consequence, cost, stakeholder, provide), and intervention for hematopoietic stem cell transplant (hematopoietic stem cell transplant, yoga, cognitive, transplantation, standard care).

Discussion
This study analyzed the literature on the quality of life of RTW cancer survivors to understand the flow of knowledge structure of the existing research and suggest directions for future research. The study found that the core keywords of the quality of life of RTW cancer survivors research included "breast", "patients", "rehabilitation", "intervention", "treatment", and "employment". The keywords with high centrality were regarded as core keywords [31]; most of these high centrality keywords were also high in frequency.
"Breast" and "patient" were ranked high in frequency, degree centrality, and betweenness centrality. Breast cancer is a carcinoma with high prevalence in women. Since the development of diagnostic and treatment methods has led to a high survival rate and long duration of survival, breast cancer is being studied in many areas. It is consistent with the previous study [22], which examined the knowledge structure of cancer survivors and showed that breast neoplasm had a higher frequency after quality of life.
Individual perception of discrimination and lack of support from employers and colleagues can negatively affect successful workplace participation and is more serious for female cancer survivors [32,33]. Women's resumption of work is negatively affected by various human resource factors [34]. Therefore, it is necessary to examine more diverse strategies and conduct intervention development studies for RTW in breast cancer patients.
RTW has been treated as a concept related to social, vocational, and physical rehabilitation for the workplace adaptation of cancer survivors, indicating that many studies are related to intervention performance. Successful vocational rehabilitation has a major impact on RTW skills and helps maintain the quality of working life of cancer survivors [35]. Additionally, cancer survivors are affected by work capacity due to neuropathy, fatigue, and chemo or radiation therapy [10,36]. Therefore, various studies have been conducted to improve their quality of life according to the ongoing treatment.
In this study, the degree centrality of the title keywords was visualized as a sociogram. In the sociogram, the size of the node represented the grade of degree centrality and the thickness of the node represented the connection strength, that is, the frequency of simultaneous occurrence.
Keywords with high degree centrality are connected to many other keywords and are located at the center of the network and thus, represent an important core topic. In this study, most of the high ranking keywords in frequency, degree centrality, and betweenness centrality appeared prominently in the sociogram. However, keywords such as "development", "experience", and "survivorship", which were not ranked high for edge, frequency, and degree centrality, were relevant as mediating roles, that is, in betweenness centrality. This is important because keywords with high betweenness centrality act as a mediator between other keywords and function as a bridge to expand from one topic to another [37,38], even though the frequency and edge are not relatively high. It is believed that a keyword plays a significant role in the knowledge structure.
Clustering results by grouping nodes with high relevance in the network. Based on connectivity, the keywords were divided into four clusters. Research on the quality of life of RTW cancer survivors was divided into the following groups: "participants and method", "type of research and variables", "return to work as education in adolescent and young adult cancer survivors", and "rehabilitation program". It implies that research is being conducted on the successful RTW of cancer survivors by applying various research methods. Moreover, previous studies have primarily used randomized trials, including physical exercise and psychosocial support of chemotherapy subjects, tailored intervention development, and systematic reviews to evaluate the effectiveness of these interventions.
In particular, intervention studies based on occupational rehabilitation and experience and education issues of RTW, such as education in adolescent and young adult cancer survivors, were considered important and found to be highly connected by the current study.
Although various intervention studies have been conducted, the effect of psycho-educational interventions without vocational rehabilitation is unclear. Therefore, a vocational rehabilitation program that includes work coordination and supervisor-centered vocational components rather than a patient-centered occupational environment program should be provided [14].
This was also studied as an important keyword in the topic analysis of the abstracts. The contents on job status, type, and quality of life were presented as a topic and issues related to cost and stem cell transplantation in adolescents and young adults were grouped as another topic. Compared to salaried workers, self-employed cancer survivors may have more difficulty in performing their jobs after RTW because they have less social support at work and less legal support from related laws and public health insurance [13]. Therefore, further research on the RTW of cancer survivors according to type of work and employment is needed. Many stem cell transplant cancer survivors complain of impairment in daily life functioning due to cognitive impairment, which is associated with younger age and reduced health-related quality of life [39]. In particular, long survival rates and cognitive impairment in cancer survivors are related to education, which in turn is related to finding a quality job, which is an important issue that needs to be addressed.
RTW is regarded as a marker of complete recovery and restored normality [9,14]. It is often characterized as a complex and prolonged trajectory [40]. While several pieces of evidence indicate that quality work is beneficial for the physical and mental health of cancer survivors, unemployment and long term absence of illness have detrimental effects [14,41].
Furthermore, unsuccessful RTW has a significant impact on the health care system and on insurance, of which direct or indirect social costs are paid by patients and their families, employers, and society [42]. Therefore, the RTW of cancer patients is an issue that should be dealt with not only individually, but also socially. Based on the trends found in the present study, future research should consider various variables for RTW and quality of life improvement.
The strength of this study lies in the text network analysis that enabled us to identify the knowledge structure and topics related to RTW cancer survivors and quality of life research effectively, objectively, and comprehensively, thus providing the basis for the continuous improvement of RTW intervention research. However, this study also has certain limitations. Because the extracted text was collected only from the titles and abstracts of published papers, the ultimate purpose and meaning of each study may have been excluded. There may have been an impact of the time-dependent nature of the study that was not reflected in the present analysis.

Conclusions
Through network analysis and clustering, keywords were divided into four knowledge structures. Similar results were confirmed using hierarchical topic analysis on the abstracts. The study divided the research subjects into four clusters and it found that studies have been conducted using a psychosocial approach, young cancer survivors, breast cancer patients, rehabilitation, and interventions, which appeared as keywords. The subject matter of the study conducted through this knowledge structure can be identified, which helped reveal the research trend and direction of the knowledge structure related to the successful RTW and work-related quality of life of cancer survivors.
This study showed the visualized trends of the knowledge structure and research direction based on the previously published literature. However, it was difficult to confirm the knowledge body about direct demands such as disturbance and facilitation factors associated with RTW experienced by cancer survivors. Therefore, it will be necessary to develop RTW interventions that reflect the needs of cancer survivors and trends of the times through network analysis based on the direct responses of cancer survivors on online big data, such as online news comments and social network services.
In addition, through this study, trends and limitations in the diversity of related studies and the importance of expanding research on various intervention programs and accumulating evidence of their impact on the quality of life of RTW cancer survivors were identified.