Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling
Abstract
:1. Introduction
2. Zero-Inflated Data Analysis
3. Proposed Method
- (Step 1) Collecting patent documents.
- (Step 2) Patent text data import and export.
- (2-1) Corpus: collection of patent documents.
- (2-2) Transformation: stemming, stop-word removal, etc.
- (Step 3) Create document–keyword matrix.
4. Experimental Results
4.1. Patent Document Data Analysis
4.2. Simulation Data Analysis
5. Discussion
6. Conclusions
Funding
Data Availability Statement
Conflicts of Interest
References
- Mikalef, P.; Krogstie, J. Examining the interplay between big data analytics and contextual factors in driving process innovation capabilities. Eur. J. Inf. Syst. 2020, 29, 260–287. [Google Scholar] [CrossRef]
- Thakur, N.; Han, C.Y. A Study of Fall Detection in Assisted Living: Identifying and Improving the Optimal Machine Learning Method. J. Sens. Actuator Netw. 2021, 10, 39. [Google Scholar] [CrossRef]
- Feinerer, I.; Hornik, K. Package ‘tm’ Version 0.7-11, Text Mining Package; CRAN of R Project, R Foundation for Statistical Com-puting: Vienna, Austria, 2023. [Google Scholar]
- Park, S.; Jun, S. Zero-Inflated Patent Data Analysis Using Compound Poisson Models. Appl. Sci. 2023, 13, 4505. [Google Scholar] [CrossRef]
- Lu, L.; Fu, Y.; Chu, P.; Zhang, X. A Bayesian Analysis of Zero-Inflated Count Data: An Application to Youth Fitness Survey. In Proceedings of the Tenth International Conference on Computational Intelligence and Security, Kunming, China, 15–16 November 2014; pp. 699–703. [Google Scholar]
- Neelon, B.; Chung, D. The LZIP: A Bayesian Latent Factor Model for Correlated Zero-Inflated Counts. Biometrics 2017, 73, 185–196. [Google Scholar] [CrossRef] [PubMed]
- Sidumo, B.; Sonono, E.; Takaidza, I. Count Regression and Machine Learning Techniques for Zero-Inflated Overdispersed Count Data: Application to Ecological Data. Ann. Data Sci. 2023, 1–15. [Google Scholar] [CrossRef]
- Yusuf, O.B.; Bello, T.; Gureje, O. Zero Inflated Poisson and Zero Inflated Negative Binomial Models with Application to Number of Falls in the Elderly. Biostat. Biom. Open Access J. 2017, 1, 69–75. [Google Scholar]
- Hilbe, J.M. Negative Binomial Regression, 2nd ed.; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
- Hilbe, J.M. Modeling Count Data; Cambridge University Press: New York, NY, USA, 2014. [Google Scholar]
- Cameron, A.C.; Trivedi, P.K. Regression Analysis of Count Data, 2nd ed.; Cambridge University Press: New York, NY, USA, 2013. [Google Scholar]
- Zhou, X.; Hu, Y.; Wu, J.; Liang, W.; Ma, J.; Jin, Q. Distribution Bias Aware Collaborative Generative Adversarial Network for Imbalanced Deep Learning in Industrial IoT. IEEE Trans. Ind. Inform. 2023, 19, 570–580. [Google Scholar] [CrossRef]
- Xu, M.; Baraldi, P.; Lu, X.; Zio, E. Generative Adversarial Networks With AdaBoost Ensemble Learning for Anomaly Detection in High-Speed Train Automatic Doors. IEEE Trans. Intell. Transp. Syst. 2022, 23, 23408–23421. [Google Scholar] [CrossRef]
- Deng, L.; He, C.; Xu, G.; Zhu, H.; Wang, H. PcGAN: A Noise Robust Conditional Generative Adversarial Network for One Shot Learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25249–25258. [Google Scholar] [CrossRef]
- Li, C.; Xu, K.; Zhu, J.; Liu, J.; Zhang, B. Triple Generative Adversarial Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 9629–9640. [Google Scholar] [CrossRef] [PubMed]
- Yan, C.; Chang, X.; Li, Z.; Guan, W.; Ge, Z.; Zhu, L.; Zheng, Q. ZeroNAS: Differentiable Generative Adversarial Networks Search for Zero-Shot Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 9733–9740. [Google Scholar] [CrossRef] [PubMed]
- Rosenfeld, B.; Simeone, O.; Rajendran, B. Spiking Generative Adversarial Networks With a Neural Network Discriminator: Local Training, Bayesian Models, and Continual Meta-Learning. IEEE Trans. Comput. 2022, 71, 2778–2791. [Google Scholar] [CrossRef]
- Tang, C.; He, Z.; Li, Y.; Lv, J. Zero-Shot Learning via Structure-Aligned Generative Adversarial Network. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6749–6762. [Google Scholar] [CrossRef] [PubMed]
- You, H.; Cheng, Y.; Cheng, T.; Li, C.; Zhou, P. Bayesian Cycle-Consistent Generative Adversarial Networks via Marginalizing Latent Sampling. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4389–4403. [Google Scholar] [CrossRef] [PubMed]
- Li, H.; Prasad, R.G.N.; Sekuboyina, A.; Niu, C.; Bai, S.; Hemmert, W.; Menze, B. Micro-Ct Synthesis and Inner Ear Super Resolution via Generative Adversarial Networks and Bayesian Inference. In Proceedings of the IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; pp. 1500–1504. [Google Scholar]
- Yang, S.; Zhou, F.; Chen, D.; Wen, C. Deep Learning Fault Diagnosis Method Based on Feature Generative Adversarial Networks for Unbalanced Data. In Proceedings of the CAA Symposium on Fault Detection, Supervision and Safety for Technical Processes (SAFEPROCESS), Xiamen, China, 5–7 July 2019; pp. 465–470. [Google Scholar]
- Yan, R.; Yuan, Y.; Wang, Z.; Geng, G.; Jiang, Q. Active Distribution System Synthesis via Unbalanced Graph Generative Adversarial Network. IEEE Trans. Power Syst. 2022, 38, 4293–4307. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1–9. [Google Scholar]
- Bruce, P.; Bruce, A.; Gedeck, P. Practical Statistics for Data Scientists; O’Reilly Media: Sebastopol, CA, USA, 2020. [Google Scholar]
- Montgomery, D.C.; Peck, E.A.; Vining, G.G. Introduction to Linear Regression Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
- Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
- Theodoridis, S. Machine Learning A Bayesian and Optimization Perspective; Elsevier: London, UK, 2015. [Google Scholar]
- R Development Core Team. R: A language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2023; Available online: http://www.R-project.org (accessed on 15 May 2023).
- Neunhoeffer, M. Package ‘RGAN’ Version 0.1.1, Generative Adversarial Nets (GAN) in R; CRAN of R Project, R Foundation for Statistical Computing: Vienna, Austria, 2022. [Google Scholar]
- Amatya, A.; Demirtas, H. PoisNor: An R Package for Generation of Multivariate Data with Poisson and Normal Marginals. Commun. Stat. Simul. Comput. 2015, 46, 2241–2253. [Google Scholar] [CrossRef]
- Li, H.; Chen, R.; Nguyen, H.; Chung, Y.; Gao, R.; Demirtas, H. Package ‘RNGforGPD’ Version 1.1.0, Random Number Generation for Generalized Poisson Distribution; CRAN of R Project, R Foundation for Statistical Computing: Vienna, Austria, 2022. [Google Scholar]
- USPTO. The United States Patent and Trademark Office. Available online: http://www.uspto.gov (accessed on 1 March 2022).
- KIPRIS. Korea Intellectual Property Rights Information Service. Available online: www.kipris.or.kr (accessed on 1 March 2022).
- Moriña, D.; Puig, P.; Navarro, A. Analysis of zero inflated dichotomous variables from a Bayesian perspective: Application to occupational health. BMC Med. Res. Methodol. 2021, 21, 277. [Google Scholar] [CrossRef] [PubMed]
Keyword | # of Zeros | Sparsity (%) | Keyword | # of Zeros | Sparsity (%) |
---|---|---|---|---|---|
wall | 2860 | 95.33 | interact | 2770 | 92.33 |
visual | 2779 | 92.63 | inform | 2636 | 87.87 |
virtual | 2101 | 70.03 | image | 2338 | 77.93 |
view | 2626 | 87.53 | head | 2771 | 92.37 |
video | 2770 | 92.33 | generate | 2410 | 80.33 |
user | 2135 | 71.17 | face | 2859 | 95.30 |
time | 2773 | 92.43 | eye | 2852 | 95.07 |
system | 1759 | 58.63 | extend | 1532 | 51.07 |
surface | 2540 | 84.67 | environment | 2503 | 83.43 |
structure | 2743 | 91.43 | electric | 2834 | 94.47 |
space | 2691 | 89.70 | edge | 2857 | 95.23 |
signal | 2761 | 92.03 | display | 2126 | 70.87 |
sensor | 2698 | 89.93 | device | 1880 | 62.67 |
scene | 2873 | 95.77 | detect | 2723 | 90.77 |
rotate | 2795 | 93.17 | data | 2490 | 83.00 |
render | 2811 | 93.70 | control | 2531 | 84.37 |
region | 2793 | 93.10 | content | 2735 | 91.17 |
receive | 2453 | 81.77 | contact | 2835 | 94.50 |
reality | 1873 | 62.43 | connect | 2655 | 88.50 |
present | 2466 | 82.20 | configure | 2334 | 77.80 |
posit | 2373 | 79.10 | compute | 2619 | 87.30 |
physic | 2788 | 92.93 | component | 2810 | 93.67 |
optic | 2757 | 91.90 | communication | 2738 | 91.27 |
object | 2542 | 84.73 | capture | 2750 | 91.67 |
move | 2791 | 93.03 | camera | 2775 | 92.50 |
mobile | 2880 | 96.00 | augment | 2419 | 80.63 |
map | 2863 | 95.43 | associate | 2654 | 88.47 |
light | 2747 | 91.57 | assemble | 2812 | 93.73 |
layer | 2837 | 94.57 | arrange | 2720 | 90.67 |
interface | 2811 | 93.70 | andor | 2678 | 89.27 |
Evaluation Measure | Original Data | Noise-Added Data | Synthetic Data |
---|---|---|---|
PRESS | 3117.086 | 3640.162 | 114.242 |
R-squared | 0.4283 | 0.3077 | 0.9755 |
Log-likelihood | −2841.364 | −4529.54 | 658.551 |
AIC | 5696.728 | 9075.08 | −1301.102 |
BIC | 5738.773 | 9123.131 | −1253.051 |
Variable | Distribution | Parameter |
---|---|---|
Predictor (X1) | Poisson | mean = 0.15 |
Predictor (X2) | Poisson | mean = 0.21 |
Predictor (X3) | Poisson | mean = 0.12 |
Predictor (X4) | Poisson | mean = 1.22 |
Predictor (X5) | Poisson | mean = 0.88 |
Response (Y) | Normal | mean = 0, variance = 1 |
Variable | X1 | X2 | X3 | X4 | X5 | Y |
---|---|---|---|---|---|---|
X1 | 1.00 | 0.42 | 0.35 | 0.25 | 0.09 | 0.14 |
X2 | 0.42 | 1.00 | 0.12 | 0.29 | −0.22 | 0.19 |
X3 | 0.35 | 0.12 | 1.00 | 0.46 | −0.14 | 0.13 |
X4 | 0.25 | 0.29 | 0.46 | 1.00 | 0.10 | 0.36 |
X5 | 0.09 | −0.22 | −0.14 | 0.10 | 1.00 | 0.58 |
Y | 0.14 | 0.19 | 0.13 | 0.36 | 0.58 | 1.00 |
Variable | Poisson Parameter | # of Zeros | Sparsity (%) |
---|---|---|---|
X1 | 0.15 | 4288 | 85.76 |
X2 | 0.21 | 4066 | 81.32 |
X3 | 0.12 | 4462 | 89.24 |
X4 | 1.22 | 1474 | 29.48 |
X5 | 0.88 | 2050 | 41.00 |
Evaluation Measure | Original Data | Noise-Added Data | Synthetic Data |
---|---|---|---|
PRESS | 2836.303 | 3943.247 | 142.9447 |
R-squared | 0.4293 | 0.2066 | 0.8305 |
Log-likelihood | −5671.349 | −6495.143 | −437.0238 |
AIC | 11,356.7 | 13,004.29 | 888.0477 |
BIC | 11,402.32 | 13,049.91 | 922.402 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jun, S. Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling. Computers 2023, 12, 258. https://doi.org/10.3390/computers12120258
Jun S. Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling. Computers. 2023; 12(12):258. https://doi.org/10.3390/computers12120258
Chicago/Turabian StyleJun, Sunghae. 2023. "Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling" Computers 12, no. 12: 258. https://doi.org/10.3390/computers12120258
APA StyleJun, S. (2023). Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling. Computers, 12(12), 258. https://doi.org/10.3390/computers12120258