Next Issue
Volume 11, March
Previous Issue
Volume 10, September
 
 

Informatics, Volume 10, Issue 4 (December 2023) – 15 articles

Cover Story (view full-size image):

The Health and Aging Brain Study–Health Disparities (HABS–HD) aims to understand factors impacting brain aging in diverse communities. A critical challenge is missing data, hindering accurate machine learning (ML). Common imputation methods may lead to biased outcomes. Thus, developing a new imputation methodology has become an urgent task for HABS–HD.

We devised a three-step workflow: 1) evaluating missing data; 2) ML-based multiple imputation; and 3) imputation evaluation. Embedded is ML-based multiple imputation (MLMI).

The MLMI excelled, demonstrating superior prediction and maintaining distribution and correlation. This workflow is effective for HABS–HD, robustly handling missing values, especially in Alzheimer's disease models, and is applicable to other disease data analyses. View this paper

  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
15 pages, 5438 KiB  
Article
EndoNet: A Model for the Automatic Calculation of H-Score on Histological Slides
by Egor Ushakov, Anton Naumov, Vladislav Fomberg, Polina Vishnyakova, Aleksandra Asaturova, Alina Badlaeva, Anna Tregubova, Evgeny Karpulevich, Gennady Sukhikh and Timur Fatkhudinov
Informatics 2023, 10(4), 90; https://doi.org/10.3390/informatics10040090 - 12 Dec 2023
Cited by 1 | Viewed by 2128
Abstract
H-score is a semi-quantitative method used to assess the presence and distribution of proteins in tissue samples by combining the intensity of staining and the percentage of stained nuclei. It is widely used but time-consuming and can be limited in terms of accuracy [...] Read more.
H-score is a semi-quantitative method used to assess the presence and distribution of proteins in tissue samples by combining the intensity of staining and the percentage of stained nuclei. It is widely used but time-consuming and can be limited in terms of accuracy and precision. Computer-aided methods may help overcome these limitations and improve the efficiency of pathologists’ workflows. In this work, we developed a model EndoNet for automatic H-score calculation on histological slides. Our proposed method uses neural networks and consists of two main parts. The first is a detection model which predicts the keypoints of centers of nuclei. The second is an H-score module that calculates the value of the H-score using mean pixel values of predicted keypoints. Our model was trained and validated on 1780 annotated tiles with a shape of 100 × 100 µm and we achieved 0.77 mAP on a test dataset. We obtained our best results in H-score calculation; these results proved superior to QuPath predictions. Moreover, the model can be adjusted to a specific specialist or whole laboratory to reproduce the manner of calculating the H-score. Thus, EndoNet is effective and robust in the analysis of histology slides, which can improve and significantly accelerate the work of pathologists. Full article
Show Figures

Figure 1

20 pages, 3244 KiB  
Article
Knowledge-Based Intelligent Text Simplification for Biological Relation Extraction
by Jaskaran Gill, Madhu Chetty, Suryani Lim and Jennifer Hallinan
Informatics 2023, 10(4), 89; https://doi.org/10.3390/informatics10040089 - 11 Dec 2023
Viewed by 2003
Abstract
Relation extraction from biological publications plays a pivotal role in accelerating scientific discovery and advancing medical research. While vast amounts of this knowledge is stored within the published literature, extracting it manually from this continually growing volume of documents is becoming increasingly arduous. [...] Read more.
Relation extraction from biological publications plays a pivotal role in accelerating scientific discovery and advancing medical research. While vast amounts of this knowledge is stored within the published literature, extracting it manually from this continually growing volume of documents is becoming increasingly arduous. Recently, attention has been focused towards automatically extracting such knowledge using pre-trained Large Language Models (LLM) and deep-learning algorithms for automated relation extraction. However, the complex syntactic structure of biological sentences, with nested entities and domain-specific terminology, and insufficient annotated training corpora, poses major challenges in accurately capturing entity relationships from the unstructured data. To address these issues, in this paper, we propose a Knowledge-based Intelligent Text Simplification (KITS) approach focused on the accurate extraction of biological relations. KITS is able to precisely and accurately capture the relational context among various binary relations within the sentence, alongside preventing any potential changes in meaning for those sentences being simplified by KITS. The experiments show that the proposed technique, using well-known performance metrics, resulted in a 21% increase in precision, with only 25% of sentences simplified in the Learning Language in Logic (LLL) dataset. Combining the proposed method with BioBERT, the popular pre-trained LLM was able to outperform other state-of-the-art methods. Full article
Show Figures

Figure 1

18 pages, 876 KiB  
Article
Unraveling Microblog Sentiment Dynamics: A Twitter Public Attitudes Analysis towards COVID-19 Cases and Deaths
by Paraskevas Koukaras, Dimitrios Rousidis and Christos Tjortjis
Informatics 2023, 10(4), 88; https://doi.org/10.3390/informatics10040088 - 7 Dec 2023
Viewed by 1597
Abstract
The identification and analysis of sentiment polarity in microblog data has drawn increased attention. Researchers and practitioners attempt to extract knowledge by evaluating public sentiment in response to global events. This study aimed to evaluate public attitudes towards the spread of COVID-19 by [...] Read more.
The identification and analysis of sentiment polarity in microblog data has drawn increased attention. Researchers and practitioners attempt to extract knowledge by evaluating public sentiment in response to global events. This study aimed to evaluate public attitudes towards the spread of COVID-19 by performing sentiment analysis on over 2.1 million tweets in English. The implications included the generation of insights for timely disease outbreak prediction and assertions regarding worldwide events, which can help policymakers take suitable actions. We investigated whether there was a correlation between public sentiment and the number of cases and deaths attributed to COVID-19. The research design integrated text preprocessing (regular expression operations, (de)tokenization, stopwords), sentiment polarization analysis via TextBlob, hypothesis formulation (null hypothesis testing), and statistical analysis (Pearson coefficient and p-value) to produce the results. The key findings highlight a correlation between sentiment polarity and deaths, starting at 41 days before and expanding up to 3 days after counting. Twitter users reacted to increased numbers of COVID-19-related deaths after four days by posting tweets with fading sentiment polarization. We also detected a strong correlation between COVID-19 Twitter conversation polarity and reported cases and a weak correlation between polarity and reported deaths. Full article
Show Figures

Figure 1

14 pages, 960 KiB  
Article
ChatGPT in Education: Empowering Educators through Methods for Recognition and Assessment
by Joost C. F. de Winter, Dimitra Dodou and Arno H. A. Stienen
Informatics 2023, 10(4), 87; https://doi.org/10.3390/informatics10040087 - 29 Nov 2023
Cited by 11 | Viewed by 5226
Abstract
ChatGPT is widely used among students, a situation that challenges educators. The current paper presents two strategies that do not push educators into a defensive role but can empower them. Firstly, we show, based on statistical analysis, that ChatGPT use can be recognized [...] Read more.
ChatGPT is widely used among students, a situation that challenges educators. The current paper presents two strategies that do not push educators into a defensive role but can empower them. Firstly, we show, based on statistical analysis, that ChatGPT use can be recognized from certain keywords such as ‘delves’ and ‘crucial’. This insight allows educators to detect ChatGPT-assisted work more effectively. Secondly, we illustrate that ChatGPT can be used to assess texts written by students. The latter topic was presented in two interactive workshops provided to educators and educational specialists. The results of the workshops, where prompts were tested live, indicated that ChatGPT, provided a targeted prompt is used, is good at recognizing errors in texts but not consistent in grading. Ethical and copyright concerns were raised as well in the workshops. In conclusion, the methods presented in this paper may help fortify the teaching methods of educators. The computer scripts that we used for live prompting are available and enable educators to give similar workshops. Full article
(This article belongs to the Topic AI Chatbots: Threat or Opportunity?)
Show Figures

Figure 1

17 pages, 1986 KiB  
Article
Automated Detection of Persuasive Content in Electronic News
by Brian Rizqi Paradisiaca Darnoto, Daniel Siahaan and Diana Purwitasari
Informatics 2023, 10(4), 86; https://doi.org/10.3390/informatics10040086 - 21 Nov 2023
Viewed by 2250
Abstract
Persuasive content in online news contains elements that aim to persuade its readers and may not necessarily include factual information. Since a news article only has some sentences that indicate persuasiveness, it would be quite challenging to differentiate news with or without the [...] Read more.
Persuasive content in online news contains elements that aim to persuade its readers and may not necessarily include factual information. Since a news article only has some sentences that indicate persuasiveness, it would be quite challenging to differentiate news with or without the persuasive content. Recognizing persuasive sentences with a text summarization and classification approach is important to understand persuasive messages effectively. Text summarization identifies arguments and key points, while classification separates persuasive sentences based on the linguistic and semantic features used. Our proposed architecture includes text summarization approaches to shorten sentences without persuasive content and then using classifiers model to detect those with persuasive indication. In this paper, we compare the performance of latent semantic analysis (LSA) and TextRank in text summarization methods, the latter of which has outperformed in all trials, and also two classifiers of convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM). We have prepared a dataset (±1700 data and manually persuasiveness-labeled) consisting of news articles written in the Indonesian language collected from a nationwide electronic news portal. Comparative studies in our experimental results show that the TextRank–BERT–BiLSTM model achieved the highest accuracy of 95% in detecting persuasive news. The text summarization methods were able to generate detailed and precise summaries of the news articles and the deep learning models were able to effectively differentiate between persuasive news and real news. Full article
(This article belongs to the Section Machine Learning)
Show Figures

Figure 1

14 pages, 885 KiB  
Article
Why Do People Use Telemedicine Apps in the Post-COVID-19 Era? Expanded TAM with E-Health Literacy and Social Influence
by Moonkyoung Jang
Informatics 2023, 10(4), 85; https://doi.org/10.3390/informatics10040085 - 6 Nov 2023
Cited by 1 | Viewed by 2398
Abstract
This study delves into the determinants influencing individuals’ intentions to adopt telemedicine apps during the COVID-19 pandemic. The study aims to offer a comprehensive framework for understanding behavioral intentions by leveraging the Technology Acceptance Model (TAM), supplemented by e-health literacy and social influence [...] Read more.
This study delves into the determinants influencing individuals’ intentions to adopt telemedicine apps during the COVID-19 pandemic. The study aims to offer a comprehensive framework for understanding behavioral intentions by leveraging the Technology Acceptance Model (TAM), supplemented by e-health literacy and social influence variables. The study analyzes survey data from 364 adults using partial least squares structural equation modeling (PLS-SEM) to empirically examine the internal relationships within the model. Results indicated that e-health literacy, attitude, and social influence significantly impacted the intention to use telemedicine apps. Notably, e-health literacy positively influenced both perceived usefulness and ease of use, expanding beyond mere usage intention. The study underscored the substantial role of social influence in predicting the intention to use telemedicine apps, challenging the traditional oversight of social influence in the TAM framework. The findings will help researchers, practitioners, and governments understand how social influence and e-health literacy influence the adoption of telehealth apps and promote the use of telehealth apps through enhancing social influence and e-health literacy. Full article
Show Figures

Figure 1

24 pages, 3743 KiB  
Article
Classifying Crowdsourced Citizen Complaints through Data Mining: Accuracy Testing of k-Nearest Neighbors, Random Forest, Support Vector Machine, and AdaBoost
by Evaristus D. Madyatmadja, Corinthias P. M. Sianipar, Cristofer Wijaya and David J. M. Sembiring
Informatics 2023, 10(4), 84; https://doi.org/10.3390/informatics10040084 - 1 Nov 2023
Cited by 3 | Viewed by 2354
Abstract
Crowdsourcing has gradually become an effective e-government process to gather citizen complaints over the implementation of various public services. In practice, the collected complaints form a massive dataset, making it difficult for government officers to analyze the big data effectively. It is consequently [...] Read more.
Crowdsourcing has gradually become an effective e-government process to gather citizen complaints over the implementation of various public services. In practice, the collected complaints form a massive dataset, making it difficult for government officers to analyze the big data effectively. It is consequently vital to use data mining algorithms to classify the citizen complaint data for efficient follow-up actions. However, different classification algorithms produce varied classification accuracies. Thus, this study aimed to compare the accuracy of several classification algorithms on crowdsourced citizen complaint data. Taking the case of the LAKSA app in Tangerang City, Indonesia, this study included k-Nearest Neighbors, Random Forest, Support Vector Machine, and AdaBoost for the accuracy assessment. The data were taken from crowdsourced citizen complaints submitted to the LAKSA app, including those aggregated from official social media channels, from May 2021 to April 2022. The results showed SVM with a linear kernel as the most accurate among the assessed algorithms (89.2%). In contrast, AdaBoost (base learner: Decision Trees) produced the lowest accuracy. Still, the accuracy levels of all algorithms varied in parallel to the amount of training data available for the actual classification categories. Overall, the assessments on all algorithms indicated that their accuracies were insignificantly different, with an overall variation of 4.3%. The AdaBoost-based classification, in particular, showed its large dependence on the choice of base learners. Looking at the method and results, this study contributes to e-government, data mining, and big data discourses. This research recommends that governments continuously conduct supervised training of classification algorithms over their crowdsourced citizen complaints to seek the highest accuracy possible, paving the way for smart and sustainable governance. Full article
(This article belongs to the Special Issue Feature Papers in Big Data)
Show Figures

Figure 1

25 pages, 2500 KiB  
Article
Federated Secure Computing
by Hendrik Ballhausen and Ludwig Christian Hinske
Informatics 2023, 10(4), 83; https://doi.org/10.3390/informatics10040083 - 31 Oct 2023
Cited by 2 | Viewed by 1828
Abstract
Privacy-preserving computation (PPC) enables encrypted computation of private data. While advantageous in theory, the complex technology has steep barriers to entry in practice. Here, we derive design goals and principles for a middleware that encapsulates the demanding cryptography server side and provides a [...] Read more.
Privacy-preserving computation (PPC) enables encrypted computation of private data. While advantageous in theory, the complex technology has steep barriers to entry in practice. Here, we derive design goals and principles for a middleware that encapsulates the demanding cryptography server side and provides a simple-to-use interface to client-side application developers. The resulting architecture, “Federated Secure Computing”, offloads computing-intensive tasks to the server and separates concerns of cryptography and business logic. It provides microservices through an Open API 3.0 definition and hosts multiple protocols through self-discovered plugins. It requires only minimal DevSecOps capabilities and is straightforward and secure. Finally, it is small enough to work in the internet of things (IoT) and in propaedeutic settings on consumer hardware. We provide benchmarks for calculations with a secure multiparty computation (SMPC) protocol, both for vertically and horizontally partitioned data. Runtimes are in the range of seconds on both dedicated workstations and IoT devices such as Raspberry Pi or smartphones. A reference implementation is available as free and open source software under the MIT license. Full article
Show Figures

Figure 1

16 pages, 335 KiB  
Review
AI Chatbots in Digital Mental Health
by Luke Balcombe
Informatics 2023, 10(4), 82; https://doi.org/10.3390/informatics10040082 - 27 Oct 2023
Cited by 10 | Viewed by 13114
Abstract
Artificial intelligence (AI) chatbots have gained prominence since 2022. Powered by big data, natural language processing (NLP) and machine learning (ML) algorithms, they offer the potential to expand capabilities, improve productivity and provide guidance and support in various domains. Human–Artificial Intelligence (HAI) is [...] Read more.
Artificial intelligence (AI) chatbots have gained prominence since 2022. Powered by big data, natural language processing (NLP) and machine learning (ML) algorithms, they offer the potential to expand capabilities, improve productivity and provide guidance and support in various domains. Human–Artificial Intelligence (HAI) is proposed to help with the integration of human values, empathy and ethical considerations into AI in order to address the limitations of AI chatbots and enhance their effectiveness. Mental health is a critical global concern, with a substantial impact on individuals, communities and economies. Digital mental health solutions, leveraging AI and ML, have emerged to address the challenges of access, stigma and cost in mental health care. Despite their potential, ethical and legal implications surrounding these technologies remain uncertain. This narrative literature review explores the potential of AI chatbots to revolutionize digital mental health while emphasizing the need for ethical, responsible and trustworthy AI algorithms. The review is guided by three key research questions: the impact of AI chatbots on technology integration, the balance between benefits and harms, and the mitigation of bias and prejudice in AI applications. Methodologically, the review involves extensive database and search engine searches, utilizing keywords related to AI chatbots and digital mental health. Peer-reviewed journal articles and media sources were purposively selected to address the research questions, resulting in a comprehensive analysis of the current state of knowledge on this evolving topic. In conclusion, AI chatbots hold promise in transforming digital mental health but must navigate complex ethical and practical challenges. The integration of HAI principles, responsible regulation and scoping reviews are crucial to maximizing their benefits while minimizing potential risks. Collaborative approaches and modern educational solutions may enhance responsible use and mitigate biases in AI applications, ensuring a more inclusive and effective digital mental health landscape. Full article
(This article belongs to the Topic AI Chatbots: Threat or Opportunity?)
21 pages, 1546 KiB  
Article
Artificial Intelligence: A Blessing or a Threat for Language Service Providers in Portugal
by Célia Tavares, Luciana Oliveira, Pedro Duarte and Manuel Moreira da Silva
Informatics 2023, 10(4), 81; https://doi.org/10.3390/informatics10040081 - 23 Oct 2023
Cited by 1 | Viewed by 4062
Abstract
According to a recent study by OpenAI, Open Research, and the University of Pennsylvania, large language models (LLMs) based on artificial intelligence (AI), such as generative pretrained transformers (GPTs), may have potential implications for the job market, specifically regarding occupations that demand writing [...] Read more.
According to a recent study by OpenAI, Open Research, and the University of Pennsylvania, large language models (LLMs) based on artificial intelligence (AI), such as generative pretrained transformers (GPTs), may have potential implications for the job market, specifically regarding occupations that demand writing or programming skills. This research points out that interpreters and translators are one of the main occupations with greater exposure to AI in the US job market (76.5%), in a trend that is expected to affect other regions of the globe. This article, following a mixed-methods survey-based research approach, provides insights into the awareness and knowledge about AI among Portuguese language service providers (LSPs), specifically regarding neural machine translation (NMT) and large language models (LLM), their actual use and usefulness, as well as their potential influence on work performance and the labour market. The results show that most professionals are unable to identify whether AI and/or automation technologies support the tools that are most used in the profession. The usefulness of AI is essentially low to moderate and the professionals who are less familiar with it and less knowledgeable also demonstrate a lack of trust in it. Two thirds of the sample estimate negative or very negative effects of AI in their profession, expressing the devaluation and replacement of experts, the reduction of income, and the reconfiguration of the career of translator to mere post-editors as major concerns. Full article
(This article belongs to the Collection Uncertainty in Digital Humanities)
Show Figures

Figure 1

14 pages, 1974 KiB  
Article
A Method for Analyzing Navigation Flows of Health Website Users Seeking Complex Health Information with Google Analytics
by Patrick Cheong-Iao Pang, Megan Munsie and Shanton Chang
Informatics 2023, 10(4), 80; https://doi.org/10.3390/informatics10040080 - 20 Oct 2023
Cited by 1 | Viewed by 1823
Abstract
People are increasingly seeking complex health information online. However, how they access this information and how influential it is on their health choices remains poorly understood. Google Analytics (GA) is a widely used web analytics tool and it has been used in academic [...] Read more.
People are increasingly seeking complex health information online. However, how they access this information and how influential it is on their health choices remains poorly understood. Google Analytics (GA) is a widely used web analytics tool and it has been used in academic research to study health information-seeking behaviors. Nevertheless, it is rarely used to study the navigation flows of health websites. To demonstrate the usefulness of GA data, we adopted both top-down and bottom-up approaches to study how web visitors navigate within a website delivering complex health information about stem cell research using GA’s device, traffic and path data. Custom Treemap and Sankey visualizations were used to illustrate the navigation flows extracted from these data in a more understandable manner. Our methodology reveals that different device and traffic types expose dissimilar search approaches. Through the visualizations, popular web pages and content categories frequently browsed together can be identified. Information on a website that is often overlooked but needed by many users can also be discovered. Our proposed method can identify content requiring improvements, enhance usability and guide a design for better addressing the needs of different audiences. This paper has implications for how web designers can use GA to help them determine users’ priorities and behaviors when navigating complex information. It highlights that even where there is complex health information, users may still want more direct and easy-to-understand navigations to retrieve such information. Full article
(This article belongs to the Section Health Informatics)
Show Figures

Figure 1

13 pages, 3018 KiB  
Article
Remote Moderated Usability Testing of a Mobile Phone App for Remote Monitoring of Pregnant Women at High Risk of Preeclampsia in Karachi, Pakistan
by Anam Shahil-Feroz, Haleema Yasmin, Sarah Saleem, Zulfiqar Bhutta and Emily Seto
Informatics 2023, 10(4), 79; https://doi.org/10.3390/informatics10040079 - 17 Oct 2023
Cited by 1 | Viewed by 1904
Abstract
This study assessed the usability of the smartphone app, named “Raabta” from the perspective of pregnant women at high risk of preeclampsia to improve the Raabta app for future implementation. Think-aloud and task-completion techniques were used with a purposive sample of [...] Read more.
This study assessed the usability of the smartphone app, named “Raabta” from the perspective of pregnant women at high risk of preeclampsia to improve the Raabta app for future implementation. Think-aloud and task-completion techniques were used with a purposive sample of 14 pregnant women at high risk of preeclampsia. The sessions were audio-recorded and later professionally transcribed for thematic analysis. The study generated learnings associated with four themes: improving the clarity of instructions, messaging, and terminology; accessibility for non-tech savvy and illiterate Urdu users; enhancing visuals and icons for user engagement; and simplifying navigation and functionality. Overall, user feedback emphasized the importance of enhancing the clarity of instructions, messaging, and terminology within the Raabta app. Voice messages and visuals were valued by users, particularly among the non-tech savvy and illiterate Urdu users, as they enhance accessibility and enable independent monitoring. Suggestions were made to enhance user engagement through visual improvements such as enhanced graphics and culturally aligned color schemes. Lastly, users highlighted the need for improved navigation both between screens and within screens to enhance the overall user experience. The Raabta app prototype will be modified based on the feedback of the users to address the unique needs of diverse groups. Full article
Show Figures

Figure 1

16 pages, 872 KiB  
Article
Qualitative Research Methods for Large Language Models: Conducting Semi-Structured Interviews with ChatGPT and BARD on Computer Science Education
by Andreas Dengel, Rupert Gehrlein, David Fernes, Sebastian Görlich, Jonas Maurer, Hai Hoang Pham, Gabriel Großmann and Niklas Dietrich genannt Eisermann
Informatics 2023, 10(4), 78; https://doi.org/10.3390/informatics10040078 - 12 Oct 2023
Cited by 9 | Viewed by 6005
Abstract
In the current era of artificial intelligence, large language models such as ChatGPT and BARD are being increasingly used for various applications, such as language translation, text generation, and human-like conversation. The fact that these models consist of large amounts of data, including [...] Read more.
In the current era of artificial intelligence, large language models such as ChatGPT and BARD are being increasingly used for various applications, such as language translation, text generation, and human-like conversation. The fact that these models consist of large amounts of data, including many different opinions and perspectives, could introduce the possibility of a new qualitative research approach: Due to the probabilistic character of their answers, “interviewing” these large language models could give insights into public opinions in a way that otherwise only interviews with large groups of subjects could deliver. However, it is not yet clear if qualitative content analysis research methods can be applied to interviews with these models. Evaluating the applicability of qualitative research methods to interviews with large language models could foster our understanding of their abilities and limitations. In this paper, we examine the applicability of qualitative content analysis research methods to interviews with ChatGPT in English, ChatGPT in German, and BARD in English on the relevance of computer science in K-12 education, which was used as an exemplary topic. We found that the answers produced by these models strongly depended on the provided context, and the same model could produce heavily differing results for the same questions. From these results and the insights throughout the process, we formulated guidelines for conducting and analyzing interviews with large language models. Our findings suggest that qualitative content analysis research methods can indeed be applied to interviews with large language models, but with careful consideration of contextual factors that may affect the responses produced by these models. The guidelines we provide can aid researchers and practitioners in conducting more nuanced and insightful interviews with large language models. From an overall view of our results, we generally do not recommend using interviews with large language models for research purposes, due to their highly unpredictable results. However, we suggest using these models as exploration tools for gaining different perspectives on research topics and for testing interview guidelines before conducting real-world interviews. Full article
(This article belongs to the Topic AI Chatbots: Threat or Opportunity?)
Show Figures

Figure 1

14 pages, 3824 KiB  
Article
A Machine Learning-Based Multiple Imputation Method for the Health and Aging Brain Study–Health Disparities
by Fan Zhang, Melissa Petersen, Leigh Johnson, James Hall, Raymond F. Palmer, Sid E. O’Bryant and on behalf of the Health and Aging Brain Study (HABS–HD) Study Team
Informatics 2023, 10(4), 77; https://doi.org/10.3390/informatics10040077 - 11 Oct 2023
Viewed by 2222
Abstract
The Health and Aging Brain Study–Health Disparities (HABS–HD) project seeks to understand the biological, social, and environmental factors that impact brain aging among diverse communities. A common issue for HABS–HD is missing data. It is impossible to achieve accurate machine learning (ML) if [...] Read more.
The Health and Aging Brain Study–Health Disparities (HABS–HD) project seeks to understand the biological, social, and environmental factors that impact brain aging among diverse communities. A common issue for HABS–HD is missing data. It is impossible to achieve accurate machine learning (ML) if data contain missing values. Therefore, developing a new imputation methodology has become an urgent task for HABS–HD. The three missing data assumptions, (1) missing completely at random (MCAR), (2) missing at random (MAR), and (3) missing not at random (MNAR), necessitate distinct imputation approaches for each mechanism of missingness. Several popular imputation methods, including listwise deletion, min, mean, predictive mean matching (PMM), classification and regression trees (CART), and missForest, may result in biased outcomes and reduced statistical power when applied to downstream analyses such as testing hypotheses related to clinical variables or utilizing machine learning to predict AD or MCI. Moreover, these commonly used imputation techniques can produce unreliable estimates of missing values if they do not account for the missingness mechanisms or if there is an inconsistency between the imputation method and the missing data mechanism in HABS–HD. Therefore, we proposed a three-step workflow to handle missing data in HABS–HD: (1) missing data evaluation, (2) imputation, and (3) imputation evaluation. First, we explored the missingness in HABS–HD. Then, we developed a machine learning-based multiple imputation method (MLMI) for imputing missing values. We built four ML-based imputation models (support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB), and lasso and elastic-net regularized generalized linear model (GLMNET)) and adapted the four ML-based models to multiple imputations using the simple averaging method. Lastly, we evaluated and compared MLMI with other common methods. Our results showed that the three-step workflow worked well for handling missing values in HABS–HD and the ML-based multiple imputation method outperformed other common methods in terms of prediction performance and change in distribution and correlation. The choice of missing handling methodology has a significant impact on the accompanying statistical analyses of HABS–HD. The conceptual three-step workflow and the ML-based multiple imputation method perform well for our Alzheimer’s disease models. They can also be applied to other disease data analyses. Full article
Show Figures

Figure 1

21 pages, 1731 KiB  
Article
Analyzing Indo-European Language Similarities Using Document Vectors
by Samuel R. Schrader and Eren Gultepe
Informatics 2023, 10(4), 76; https://doi.org/10.3390/informatics10040076 - 26 Sep 2023
Viewed by 2133
Abstract
The evaluation of similarities between natural languages often relies on prior knowledge of the languages being studied. We describe three methods for building phylogenetic trees and clustering languages without the use of language-specific information. The input to our methods is a set of [...] Read more.
The evaluation of similarities between natural languages often relies on prior knowledge of the languages being studied. We describe three methods for building phylogenetic trees and clustering languages without the use of language-specific information. The input to our methods is a set of document vectors trained on a corpus of parallel translations of the Bible into 22 Indo-European languages, representing 4 language families: Indo-Iranian, Slavic, Germanic, and Romance. This text corpus consists of a set of 532,092 Bible verses, with 24,186 identical verses translated into each language. The methods are (A) hierarchical clustering using distance between language vector centroids, (B) hierarchical clustering using a network-derived distance measure, and (C) Deep Embedded Clustering (DEC) of language vectors. We evaluate our methods using a ground-truth tree and language families derived from said tree. All three achieve clustering F-scores above 0.9 on the Indo-Iranian and Slavic families; most confusion is between the Germanic and Romance families. The mean F-scores across all families are 0.864 (centroid clustering), 0.953 (network partitioning), and 0.763 (DEC). This shows that document vectors can be used to capture and compare linguistic features of multilingual texts, and thus could help extend language similarity and other translation studies research. Full article
(This article belongs to the Special Issue Digital Humanities and Visualization)
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop