Spatio-Temporal Information Extraction and Geoparsing for Public Chinese Resumes
Abstract
:1. Introduction
2. Related Works
2.1. Information Extraction
2.2. Geoparsing
3. Materials and Methods
3.1. Data
3.2. Methodology
4. Resume Caption Lexicons Construction
4.1. Statistical-Based Caption Lexicon Construction
4.2. Text Similarity-Based Caption Lexicon Expansion
5. Resume Information Extraction Scheme
5.1. Resume Text Chunking
- Step 1: Caption trigger word targeting
- Step 2: Caption trigger word sorting
- Step 3: Resume text cutting
5.2. Resume Text Chunk Content Normalization
5.3. Resume Text Entity Recognition
5.3.1. Rule-Based Temporal Entity Recognition
- P1: re.compile(r’\d{4}’),
- P2: re.compile(r’(\d{2})/’),
- P3: re.compile(r’/(\d{2})’), and
- P4: re.compile(r’\d{2}’).
5.3.2. Deep Learning-Based Entity Recognition
6. Spatial Entities Decoding
6.1. Spatial Entity Geocoding Based on Baidu Baike
Algorithm 1. (page parsing algorithm) |
Step1: Recognize the current webpage title through the cpca library, and judge the recognition result. If the province and city is complete, Step5 will be executed, otherwise Step2 will be executed. |
Step2: Iterate through the basic information column of the webpage and identify the content of the location. Execute Step5 if the province and city (city may be missing) is obtained, otherwise execute Step3. |
Step3: Pattern matching to identify the location in the web paragraph. Execute Step5 if province and city (city may be missing) is obtained, otherwise execute Step4. |
Step4: Do word segmentation for the webpage overview and body content, and build a dictionary of words in the form of {key = city name, value = number of occurrences}. If the dictionary is not empty and the maximum number of occurrences is not less than 5, the city with the highest number of occurrences will be the final result and the corresponding province will be obtained; otherwise, the province and city will be empty. Execute Step5. |
Step5: Finish and return to the province and city. |
6.2. Threshold Setting
7. Experimental Evaluation
7.1. Dataset
7.2. Evaluation Indexes
7.3. Recognition Results
7.4. Geocoding Results
8. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
References
- Zu, S.; Wang, X. Resume Information Extraction with A Novel Text Block Segmentation Algorithm. Int. J. Nat. Lang. Comput. 2019, 8, 29–48. [Google Scholar] [CrossRef]
- Grishman, R. Twenty-five years of information extraction. Nat. Lang. Eng. 2019, 25, 677–692. [Google Scholar] [CrossRef]
- Soderland, S. Learning information extraction rules for semi-structured and free text. Mach. Learn. 1999, 34, 233–272. [Google Scholar] [CrossRef]
- Freitag, D.; McCallum, A. Information extraction with HMMs and shrinkage. In Proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction, Orlando, FL, USA, 18–19 July 1999; pp. 31–36. [Google Scholar]
- Yang, Y.; Wu, Z.; Yang, Y.; Lian, S.; Guo, F.; Wang, Z. A Survey of Information Extraction Based on Deep Learning. Appl. Sci. 2022, 12, 9691. [Google Scholar] [CrossRef]
- Bharadwaj, R.; Mahajan, D.; Bharsakle, M.; Meshram, K.; Pujari, H. Resume analysis using NLP. In Inventive Systems and Control; Suma, V., Lorenz, P., Baig, Z., Eds.; Springer Nature: Singapore, 2023; pp. 551–561. [Google Scholar]
- Li, X.; Shu, H.; Guang, Y.; Zhai, Y.; Yang, Z. Survey of the Application of Natural Language Processing for Resume Analysis. Comput. Sci. 2022, 49, 66–73. [Google Scholar] [CrossRef]
- Shen, K.; Huang, H.; Hua, B. Constructing Knowledge Graph with Public Resumes. Data Anal. Knowl. Discov. 2021, 5, 81–90. [Google Scholar]
- Tao, L.; Xie, Z.; Xu, D.; Ma, K.; Qiu, Q.; Pan, S.; Huang, B. Geographic Named Entity Recognition by Employing Natural Language Processing and an Improved BERT Model. ISPRS Int. J. Geo-Inf. 2022, 11, 598. [Google Scholar] [CrossRef]
- Gritta, M.; Pilehvar, M.T.; Limsopatham, N.; Collier, N. What’s missing in geographical parsing? Lang. Resour. Eval. 2018, 52, 603–623. [Google Scholar] [CrossRef]
- Ciravegna, F.; Lavelli, A. LearningPinocchio: Adaptive information extraction for real world applications. Nat. Lang. Eng. 2004, 10, 145–165. [Google Scholar] [CrossRef]
- Kopparapu, S.K. Automatic extraction of usable information from unstructured resumes to aid search. In Proceedings of the 2010 IEEE International Conference on Progress in Informatics and Computing, Shanghai, China, 10–12 December 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 99–103. [Google Scholar]
- Gaur, B.; Saluja, G.S.; Sivakumar, H.B.; Singh, S. Semi-supervised deep learning based named entity recognition model to parse education section of resumes. Neural Comput. Appl. 2021, 33, 5705–5718. [Google Scholar] [CrossRef]
- Qiao, L.; Li, C.; Zhong, Z.; Wang, J.; Liu, D. Research on People’s Information Extraction Based on Rules. J. Nanjing Norm. Univ. (Nat. Sci. Ed.) 2012, 35, 134–139. [Google Scholar]
- Li, H.; Yang, Y.; Yin, H. Research on character attributes extraction based on rules from Baidu encyclopedia. J. Integr. Technol. 2013, 2, 1–4. [Google Scholar]
- Yu, D.; Liu, C.; Tian, Y. Personal title and career attributes extraction based on distant supervision and pattern matching. J. Comput. Appl. 2016, 36, 455–459. [Google Scholar]
- Dong, F.; Wang, J. Personal Information Extraction of the Teaching Staff Based on CRFs. In Proceedings of the International Conference on Network & Information Systems for Computers, Wuhan, China, 23–25 January 2015; IEEE: Piscataway, NJ, USA, 2015. [Google Scholar]
- Chen, J.; Zhang, C.; Niu, Z. A two-step resume information extraction algorithm. Math. Probl. Eng. 2018, 2018, 5761287. [Google Scholar] [CrossRef]
- Yang, Y.; Bai, Y.; Cai, D.; He, J. Information extraction for resumes of scientific and technological figures. Comput. Eng. Des. 2021, 42, 3099–3106. [Google Scholar]
- Guo, J.; Wan, G.; Hu, X.; Wei, Z. Chinese resume named entity recognition based on BERT. J. Comput. Appl. 2021, 41, 15–19. [Google Scholar]
- Lin, J.; Cao, D.; Yuan, C. Automatic TIMEX2 tagging of Chinese temporal information. J. Tsinghua Univ. (Sci. Technol.) 2008, 48, 117–120. [Google Scholar]
- Wu, T.; Zhou, Y.; Huang, X.; Wu, L. Chinese time expression recognition base on automatically generated basic-time-unit rules. J. Chin. Inf. Process. 2010, 24, 3–10. [Google Scholar]
- Wen, Y.; Tan, H.; Zheng, J. Research on time standardization based on rules. In Proceedings of the 2009 International Information Technology and Applications Forum, Chengdu, China, 15–17 May 2009; pp. 49–51. [Google Scholar]
- Zhang, C.J.; Zhang, X.Y.; Li, M.; Wang, S. Interpretation of temporal information in Chinese text. Geogr. Geo-Inf. Sci. 2014, 30, 1–7. [Google Scholar]
- Qiu, Q.; Xie, Z.; Wu, L.; Tao, L. Automatic spatiotemporal and semantic information extraction from unstructured geoscience reports using text mining techniques. Earth Sci. Inform. 2020, 13, 1393–1410. [Google Scholar] [CrossRef]
- Wu, L.; Liu, L.; Li, H.; Gao, Y. A Chinese Toponym Recognition Method Based on Conditional Random Feild. Geomat. Inf. Sci. Wuhan Univ. 2017, 42, 150–156. [Google Scholar]
- Mao, P.; Teng, W. Complex Chinese place name recognition based on conditional rangdom field and rule improvement. Eng. J. Wuhan Univ. 2020, 53, 447–454. [Google Scholar]
- Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
- Xu, L.; Dong, Q.; Liao, Y.; Yu, C.; Tian, Y.; Liu, W.; Li, L.; Liu, C.; Zhang, X. CLUENER2020: Fine-grained named entity recognition dataset and benchmark for Chinese. arXiv 2020, arXiv:2001.04351. [Google Scholar]
- Gelernter, J.; Balaji, S. An algorithm for local geoparsing of microtext. GeoInformatica 2013, 17, 635–667. [Google Scholar] [CrossRef]
- Liu, X.; Li, Y.; Yin, B.; Tian, Q. Chinese address understanding by integrating neural network and spatial relationship. Sci. Surv. Mapp. 2021, 46, 165–171+212. [Google Scholar]
- Zhang, H.; Du, Q.; Chen, Z.; Zhang, C. A Chinese Address Parsing Method Using RoBERTa-BiLSTM-CRF. Geomat. Inf. Sci. Wuhan Univ. 2022, 47, 665–672. [Google Scholar]
- He, C.; Wan, Y. Optimization and Application of Online Multi-source Geocoding Fusin. Geospat. Inf. 2023, 21, 45–47+116. [Google Scholar]
- Zhu, X. Comparison of geocoding errors for community addresses and road addresses. Jiangsu Sci. Technol. Inf. 2022, 39, 70–75. [Google Scholar]
- Yan, W. Information Extraction for Semi-Structured Chinese Resume. Master’s Thesis, South China University of Technology, Guangzhou, China, 2018. [Google Scholar]
- Chen, E.; Jiang, E. Review of Studies on Text Similary Measures. Data Anal. Knowl. Discov. 2017, 1, 1–11. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 2002, 5, 157–166. [Google Scholar] [CrossRef] [PubMed]
- Li, S.; Zhao, Z.; Hu, R.; Li, W.; Liu, T.; Du, X. Analogical Reasoning on Chinese Morphological and Semantic Relations. arXiv 2018, arXiv:1805.06504. [Google Scholar]
- Zhang, Y.; Yang, J. Chinese NER Using Lattice LSTM. arXiv 2018, arXiv:1805.02023. [Google Scholar]
Caption Category | Caption Lexicon |
---|---|
Basic Information | “Basic Information”, “Personal Introduction”, “Personal Information” |
Personal Resume | “Personal Resume”, “Curriculum Vitae”, “Study and Work Experience” |
Work Experience | ”Work Experience”, “Professional Experience”, “Job Resume” |
Education Experience | “Educational Experience”, “Educational Background”, “Study Experience”, “Academic Experience” |
Research Areas | “Research Areas”, “Research Content”, “Research Direction” |
Teaching and Research | “Lecture Courses”, “Teaching Course”, “Teaching Situation”, ”Teaching”, “Teaching and Scientific Research”, “Teaching Curriculum” |
Awards and Honors | “Scholastic Honor”, “Awards and Honors” |
Tenure and Part-Time | “Adjunct Research Position”, “Social Position”, “Academic Participation” |
Scientific Research | “Scientific Research”, “Research Projects”, “Hosting Projects”, “Research Summary” |
Thesis Results | “Published Papers”, “Books and Papers”, “Representative Researches”, “Research Results”, “Representative Papers”, “Academic Works” |
Contact Details | “Contact Details” |
Rules | Operations |
---|---|
Time | Merge with next line |
End with punctuation | Merge with next line |
Begin with punctuation | Merge with previous line |
Short text | Merge with previous line |
Plain English text | Merge with previous line |
Type of Time Description | Example |
---|---|
Full description | 06/2012–11/2017 |
2011.2–2011.8 | |
2017/07–2020/10 | |
07/2002–08/2005 | |
1993–1996 | |
Omitted description | 98/09–01/06 |
09/94–07/97 | |
04–13 |
Method | Description of the Threshold Settings |
---|---|
Baidu Baike Search | When the text editing distance between the retrieved entity and the title of the related entry after deactivating stopwords is within 5 or the cosine similarity at the character level [39] is greater than or equal to 0.9, the retrieved entity is considered to be the same entity as the related entry if it appears in the overview paragraph of the entry or if the cosine similarity is greater than or equal to 0.95. |
Baidu Map API Query | After deactivating stopwords for the POI names returned by the Baidu Map Location Retrieval Service API, if the text editing distance with the retrieved entity is within 2 or the cosine similarity is greater than or equal to 0.95, the returned POI is considered to be the same entity as the retrieved entity. |
Type of Entity | Precision/% | Recall/% | F1 Value/% |
---|---|---|---|
Nationality | 100 | 100 | 100 |
Education | 99.11 | 99.11 | 99.11 |
Native place/location | 100 | 100 | 100 |
Name | 100 | 100 | 100 |
Orgnization | 94.07 | 94.65 | 94.36 |
Profession | 97.06 | 100 | 98.51 |
Ethnicity | 100 | 100 | 100 |
Position | 96.39 | 94.47 | 95.42 |
Models | Precision/% | Recall/% | F1 Value/% |
---|---|---|---|
BiLSTM | 90.69 | 87.30 | 88.97 |
BiLSTM-CRF | 91.47 | 88.77 | 90.10 |
BERT-CRF | 90.10 | 95.70 | 95.63 |
BERT-BiLSTM-CRF | 96.23 | 95.63 | 95.93 |
Methods | Precision/% | Recall/% | F1 Value/% |
---|---|---|---|
Spatial entity geocoding | 97.91 | 96.35 | 97.12 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, X.; Zhang, W.; Wang, Y.; Tan, Y.; Xia, J. Spatio-Temporal Information Extraction and Geoparsing for Public Chinese Resumes. ISPRS Int. J. Geo-Inf. 2023, 12, 377. https://doi.org/10.3390/ijgi12090377
Li X, Zhang W, Wang Y, Tan Y, Xia J. Spatio-Temporal Information Extraction and Geoparsing for Public Chinese Resumes. ISPRS International Journal of Geo-Information. 2023; 12(9):377. https://doi.org/10.3390/ijgi12090377
Chicago/Turabian StyleLi, Xiaolong, Wu Zhang, Yanjie Wang, Yongbin Tan, and Jing Xia. 2023. "Spatio-Temporal Information Extraction and Geoparsing for Public Chinese Resumes" ISPRS International Journal of Geo-Information 12, no. 9: 377. https://doi.org/10.3390/ijgi12090377
APA StyleLi, X., Zhang, W., Wang, Y., Tan, Y., & Xia, J. (2023). Spatio-Temporal Information Extraction and Geoparsing for Public Chinese Resumes. ISPRS International Journal of Geo-Information, 12(9), 377. https://doi.org/10.3390/ijgi12090377