5.1. Patient Selection and Management:
It is increasingly recognized that a personalized approach is needed in the care of patients. This is particularly true for patients on PD, as it utilizes a patient’s native peritoneal membrane for clearance and ultrafiltration. While PD offers many benefits, patients may develop peritonitis, which necessitates the early administration of appropriate antibiotics. The use of ML can assist in mitigating these challenges.
The fast transport rate of the peritoneal membrane, or the rate at which water and solutes move across the peritoneal membrane, can be associated with suboptimal outcomes in patients using PD, including higher mortality, higher technique failure, peritonitis, and protein energy malnutrition [
4,
12]. Discussion with patients about ways to optimize their treatment or even early transition to hemodialysis may lead to better outcomes for patients who have a fast transport rate [
12,
13]. These conversations can be difficult, particularly for patients who prefer PD for lifestyle reasons. It would, therefore, be helpful to approximate the patient transport rate prior to dialysis therapy initiation, but the transport rate is difficult to predict prior to initiating peritoneal dialysis therapy [
12,
14]. Chen et al. used an artificial neural network (ANN), a deep learning approach, to stratify patients’ transporter status using pre-dialysis information. The algorithm used pre-dialysis demographics, comorbidities, and blood and urine chemistries to predict transport status. The group compared the model output against the ground truth transport status from peritoneal equilibration test results within 1 month of dialysis initiation. The model performed well, with an AUC of 0.812 ± 0.041 [
14]. While this study is promising and could potentially be used to help identify patients’ transport rate and help with initial prescription formulation, the overall small sample size (
n = 111) and lack of external validation are key limitations of the study. Stratification tools used to predict outcomes and subsequently suggest treatment modalities have vast potential but must only be used with shared decision making and in the context of appropriate patient factors. For example, PD may be preferable for patients who live in rural areas with less access to dialysis units where monthly visits are preferred over thrice weekly in-center hemodialysis treatments, regardless of transport membrane status.
Peritonitis is a potential complication for people using PD that can contribute to patient dropout. Upon presentation, appropriate antibiotic treatment can be delayed by the time it takes to process and identify the culprit pathogens. Providers typically utilize empiric antibiotic treatment until the lab returns the pathology report from fluid samples. Early pathogen identification could expedite targeted antibiotic treatment and, therefore, improve patient outcomes. Delayed and inappropriate treatment can lead to chronic or recurrent infections. The use of broad-spectrum antibiotics for empiric therapy is a major contributor to multidrug resistance. Recognizing that the immune system responds to different pathogens in different ways, Zhang et al. aimed to identify and differentiate unique immune fingerprints that correlated with infection by Gram-negative and Gram-positive bacteria in patients on peritoneal dialysis. To select the most important features to train their models, they used recursive feature elimination to identify biomarkers that are associated with Gram-negative infections. The feature elimination approach tested one of the three ML approaches—support vector machine, which attempts to categorize data in a hyperplane such that the space between each group is maximized, ANN, and random forest, which is an ensemble decision tree analysis that utilizes the output of multiple decision trees to come to a conclusion [
15]. The random forest model performed best and was used to identify five biomarkers that predicted pathogen species. Within Gram-positive bacteria, a random forest model could also identify biomarkers that differentiated immune responses to some
Streptococcal and
Staphylococcus subspecies. Zhang et al. also identified combinations of biomarkers that could be used to identify clinically meaningful subgroups. For example, the biomarkers matrix metalloproteinase 8, soluble IL-6 receptor, calprotectin, CD4:CD8, and transforming growth factor-β were associated with technique failure.
Early pathogen and outcome prediction using ML shows great promise for expediting and improving antibiotic treatment in patients treated with PD. However, further studies are necessary before broad implementation is appropriate. Each algorithm used had strong prediction performance, with an AUC of more than 0.9, but unique combinations of biomarkers were required for each task to achieve the best predictive capacity. Models that are able to differentiate pathogens and outcomes using a consistent and limited set of biomarkers would reduce computing power requirements and, therefore, be the most cost efficient, allowing for adoption across different healthcare settings. Zhang et al. trained their algorithm with a relatively small dataset of Welsh patients on PD: 83 patients with acute infection and 17 patients without infection. The results will need to be replicated in larger and more diverse cohorts to demonstrate generalizability. Immunogenicity can change with pathogen mutation, and it is further affected by many variables, many of which are influenced by patient-specific characteristics and geographic location. Future algorithm construction will either require a specific focus on individual locations and pathogens or extensive training with large datasets of geographically diverse patients. For this task, the model could be improved through the inclusion of basic demographic and electronic health record data. Analysis in other patient populations has shown that while application of ML with unimodal data is impressive, it can be further leveraged by including multimodal data, including patient demographics, longitudinal biomarker data, imaging, and notes from the electronic health record [
16,
17]. Researchers have utilized the ability of models to interpret multimodal data to predict the progression of kidney disease with high accuracy [
18,
19].
5.2. Predicting Technique Failure
Though PD is cost effective and preferred among some patients, high rates of technique failure can discourage the adoption of the approach. Tangri et al. assessed the feasibility of using an ANN approach to predict technique failure, defined as a change to hemodialysis for at least one month. In 2008, they used data from the United Kingdom Renal Registry and found that an ANN approach could predict technique failure with moderate accuracy (AUC 0.760) and with greater precision than logistic regression [
20]. The approach was limited by sample selection bias, as all participants were currently treated with PD, and incomplete data collection. Etiology of technique failure was not documented, and two-thirds of the included participants lacked comorbidity data. In 2011, the group implemented an ANN approach again using the United Kingdom Renal Registry and compared it to logistic regression, but they included more complete data [
21]. They found that the model could predict technique failure with an AUC of 0.765 and that the inclusion of PD center and time on dialysis had the largest impact on the fit of the model. The regression model performed almost as well, with an AUC of 0.762, but the variables with the largest impact on model fit differed; they were observation time, age, race, weight, laboratory data, and dialysis center [
21]. These studies are still limited by external validation, sample size, and data completeness, but they provide evidence that an ANN approach could be useful in predicting technique failure. Larger datasets and advanced deep learning approaches that are capable of leveraging longitudinal and multimodal data have become available since the time of this paper, which could improve prediction accuracy and consistency in future iterations.
5.3. Predicting Outcomes of PD
ML is well suited to prediction tasks as machines can hold immense quantities of data and identify hidden patterns that might not be discernible through prediction-driven analysis by human researchers. The prediction of patient outcomes could help quantify expectations for patients and allow providers to make decisions about treatment course. Much research has focused on developing ML models for clinical outcomes, such as mortality and hospital length of stay.
Patients on PD are frequently admitted to the hospital [
22]. The prediction of anticipated hospital length of stay can identify patients who are at risk for prolonged hospitalizations and promote the allocation of resources to support those individuals. One study by Kong et al. used a stacking model, or a model comprising multiple ML approaches, to predict patients at risk of a prolonged length of hospital stay, defined as more than 16 days [
23]. The input data included hospitalization admissions data from 23,992 patients treated with PD in the Hospital Quality Monitoring System (HQMS), a mandatory national database in China, where the study took place. Prediction accuracy was compared to a traditional logistic regression model, output from each of the base models that comprise the stacking model, and the actual length of stay from the database. Each approach yielded varying degrees of specificity and sensitivity, and the stacking model outperformed the baseline logistic regression in terms of accuracy (0.695) and specificity (0.701). The stacking model and the random forest base model performed (Brier score 0.174 for both, with a lower score indicating better model performance) and discriminated (AUROC of 0.757 for the stacking model and 0.756 for the random forest model) equally well [
23]. In a follow-up study, this group used ML to identify ten variables that were most predictive of risk of a prolonged hospital stay to develop a scoring tool to predict prolonged length of stay in patients on PD [
24]. The random forest approach had the best metrics (Brier score 0.158 and AUROC 0.756) and was, therefore, used to identify the ten predictive variables [
24]. The logistic regression using these variables performed well (AUROC of 0.721) and was used to construct a scoring tool to stratify patients into three groups: low, median, and high risk for prolonged length of stay [
24].
While the authors used large datasets with extensive variables, they noted that some potentially significant variables were not accessible at the time of analysis and could have had more predictive power than those selected. The study is limited by a lack of external validity. All patients included resided in China, and some of the variables included in the models are not generalizable. For example, in the Wu study, three of the ten predictive variables were patients’ place of residence in regions in China. While patients’ residence and housing status are important proxies that are likely indicative of the patient’s social determinants of health, this information is not collected uniformly across countries. This highlights the need for standardized data collection that will allow for the development of generalizable models.
Despite overall improvements in the care of patients on PD, mortality remains high at 150.4 per 1000 person-years [
22]. Substantial research efforts have been dedicated to improving the prediction of mortality in patients on PD. Zhou et al. used data from one hospital in China to develop deep learning ML models to predict mortality [
25]. They compared the performance of a logistic regression model with two deep learning models: a classic ANN and a mixed model. In their classic ANN model, they integrated numerical and classical models, while in their mixed model, they built two separate neural networks for numerical variables and categorical variables and combined them. Among 859 patients included in the study, 77 met the primary endpoint within the study timeframe. In the test dataset, the logistic regression model had a higher AUROC than either of the ANN models, demonstrating that a more complex architecture is not always superior to standard statistical approaches. However, the ANN mixed model performed better in the follow-up datasets. This study has several notable limitations: it is an imbalanced dataset, as the primary outcome of mortality occurred infrequently; it has a small sample size; and it lacks geographical and operational diversity, as it took place at a single center.
In another study, Noh et al. used data from 1730 patients treated with PD from a nationwide Korean ESKD database to evaluate several ML models’ predictive ability of time to death and 5-year odds of death [
26]. They compared the performance of Cox regression, a standard statistical model, to tree-based methods, including survival trees and more complex deep learning models. For analysis without imputation of missing variables, the survival tree method had better performance compared to the traditional Cox model, with an AUC of 0.769 vs. 0.746, respectively. For their 5-year prediction models, logistic regression performed well, with an AUC of 0.804. With the incorporation of a Long Short-Term Memory network approach, which allowed the algorithm to use an individual patient’s longitudinal data by “remembering” past data points and iteratively identifying patterns through inclusion of each additional data point, the AUC improved to 0.840. With addition of an auto-encoder, which allowed the algorithm to infer missing data points based on the patterns it assessed, the AUC was improved to 0.858 [
26]. While only the AUC for model performance was reported, other informative metrics would be useful to externally assess performance. The study is again limited by its lack of external validation and imbalanced dataset, but it demonstrates that the integration of methods may be the most effective approach in predicting patient outcomes.
While several ML models for clinical outcomes in patients on PD have been developed, none, to our knowledge, have yet seen widespread use or acceptance. ML models require substantial computing power that is currently infeasible when considered on the scale of hundreds of thousands of medical records. Current models are developed in small samples and have not been externally validated; therefore, it is unclear if they are generalizable to different or more diverse patient samples. Small sample size also predisposes models to future error. Deep learning approaches, including neural networks, are susceptible to overfitting such that the algorithm is overly specific to the training data and cannot perform on new unseen data. Overfitting can occur as a result of a variety of common implementation pitfalls: small sample size, disproportionate noise, inclusion of irrelevant data, and continuous iteration or over-training on a single dataset. In these cases, the algorithm could strain to identify a pattern that explains the inclusion of all data points and subsequently misrepresent the pattern as it exists outside the training data in broader examples. Deep learning approaches, like Long Short-Term Memory, are often considered a “black box”, as it is difficult to understand the components that contributed to the conclusion. In the context of making decisions about patient care, provider comprehension and the capacity for validation are essential. Further work and provider education are needed before models can be implemented.
5.4. Patient Education for PD
Home kidney replacement therapies, such as PD, require high levels of patient health education, engagement, and self-care capacity. As the patient is mostly independent, questions regarding their treatments or about their PD catheter can arise at any time of the day or night. It can be difficult to expeditiously provide answers and instructions to patients when they arise. AI-based educational tools designed for patient use could effectively fill this gap. At this time, open-source AI tools are not yet reliable enough to accurately answer medical questions from patients. It is recommended that patients continue to obtain information directly from their healthcare providers. To improve this, researchers are studying how current AI tools perform and looking for ways to make them better. Some are also working on creating accurate AI-based educational tools specifically for patients with kidney disease and those treated with PD using trusted information and built-in safety checks.
Cheng et al. developed a LINE application-based AI chatbot to provide patients with educational material, including instructional videos, clinical reminders, home care, and dietary guidance [
27]. Overall satisfaction with the chatbot was high, with an average patient satisfaction score of 4.5/5.0. The group tracked clicks and found that patients most frequently utilized sections on home PD care and PD dietary guidance. There was no significant difference in infection rates based on analysis of three months before and after the implementation, but there was an associated reduction in technique-related peritonitis one month before and after implementation (relative risk = 0.8) [
27].
The results indicate that the implementation of the chatbot improved patient self-care efficacy and health knowledge. Chatbots and intentionally designed patient education delivery systems are useful in disseminating reliable knowledge quickly and appropriately. We suspect the use of chatbots to increase with increased accessibility to large language models, such as ChatGPT. However, the illusion of communicating with a trained healthcare provider can lead to overreliance on the output of the chatbot, which could discourage patients from talking to their providers about concerns [
28]. Every patient is different, and providers must balance patient interaction and implementation of automated tools to enhance the patient experience. Chatbots must have accurate information to maintain patient trust. Accessibility of applications and language is important, though it can be difficult to anticipate an individual’s specific needs during application development. Cheng et al. noted that their chatbot could only be accessed on a smartphone or pad and that it was not yet accessible for patients with poor eyesight [
27]. Not all patients have access to technology or are technologically literate; it will be important to provide detailed instructions or adequate alternatives for patients who cannot access these resources.
Patient privacy and bias mitigation are concerns for all modalities of AI tool implementation, especially chatbots and patient-facing interactive tools. Part of the strength of large language models’ performance is that they continue to learn from inputs, even after implementation. This allows the algorithm to obtain feedback from users to inform future behavior. Large language models, like ChatGPT and others, can be downloaded and run locally, meaning the trained algorithm is used, but the input data are not saved. Clear informed consent when implementing these tools is key, particularly in cases where the large language model stores and learns from input data and where patients might be inclined to enter personal information. Further, natural language processing and large language models are tools created from preexisting data, subject to human error and bias. Patterns recognized by these tools, therefore, can replicate systemic issues in the data, including disparate responses to certain language.
Given concerns about the appropriateness of large language model responses for chronic kidney disease (CKD) education, Acharya et al. tested the quality and accuracy of AI responses to standardized questions about nephrology care. The models tested included Bard AI, Bing AI, and two versions of ChatGPT, all of which were trained on large sets of publicly available data on the internet [
29]. Two nephrologists reviewed the AI-generated answers to identify erroneous or misleading answers. They found that responses were generally correct but frequently provided incorrect references and sometimes provided misleading information. Any ML model is susceptible to hallucinations or incorrect or nonsensical outputs that do not appear to arise from patterns that exist in the data. This error can arise from many reasons: incomplete or inadequate training data, overfitting, or high algorithm complexity are just a few. As the authors indicate, a drawback to training large language models on large amounts of public data is that the algorithm can retrieve medical information from sources that are not necessarily peer reviewed and trustworthy [
29]. To maintain trust in and reliability of ML tools, every effort must be made to ensure accurate information is disseminated.
Another area of research is focused on the ability of large language models to answer medical knowledge questions. Wu et al. tested the performance of several open-source and proprietary large language models on the Nephrology Self-Assessment Program (nephSAP), a multiple-choice exam that nephrologists can use for self-assessment, to evaluate their ability to answer complex medical questions. In a world in which patients increasingly ask open-source large language models medical questions, it is useful to understand their capabilities to advise patients in advance about their accuracy. The group found that the proprietary models GPT-4 and Claude-2 performed better than the open-source models, but no model achieved the human-passing score of 75% [
30]. GPT-4 received a score of 73.3%, Claude-2 received a score of 54.4%, and the open-source models received scores ranging from 17.1% to 30.6%. GPT-4 performed worst on electrolyte questions, likely because large language models tend to have poor “reasoning” abilities [
30]. The authors also used BLEU and Word Error Rates (WERs), a word-level metric derived from the character-level Levenshtein distance, to assess the quality of the output. All models had WER scores between 0 and 22%. When assessed with BLEU, all models scored below approximately 0.1, indicating suboptimal matching to reference output text. The difference in performance is surprising, but the authors hypothesized that the difference in performance lies in the proprietary models’ access to peer-reviewed third-party medical data as part of training [
30]. A vast quantity of peer-reviewed data are not open source and are, therefore, omitted from the training of the open-source models. Access to proprietary large language models is limited by the high cost of obtaining a license. In future medical large language models and informational chatbots, it will be essential to ensure appropriate access to trustworthy data to improve performance. This access will be especially important for patients with complex medical treatment, like those treated with PD. Along with the advent of proprietary models, there is potential for discrepant access to the highest quality tools due to socioeconomic status. Ethical deployment of these tools needs to ensure access to everyone equally.
To address large language model hallucinations and the potential for misleading output, researchers have aimed to find ways to guide the models’ output to include data from appropriate sources. One method is retrieval-augmented generation, in which the large language model enhances its response to prompts using fresh data it retrieves from relevant and trusted sources. Particularly when large language models are trained on historical data, this method helps ensure that the output is updated and appropriately related to the scenario in question. For patients treated with PD requesting information about their symptoms or medical treatment, updated and accurate information is particularly crucial. Such a tool does not yet exist for patients treated with PD, but Miao et al. implemented the retrieval-augmented generation approach to improve GPT-4’s responses to inquiries about chronic kidney disease [
31]. The team created a chronic kidney disease knowledge retrieval model that could quickly source information from a curated database with information based on the KDIGO 2023 chronic kidney disease guidelines. When prompted, the general GPT-4 tended to provide broad recommendations, while the GPT with integrated retrieval-augmented generation approach tended to output more specific recommendations that aligned with the KDIGO 2023 chronic kidney disease guidelines. As the authors outline, the implementation of a nephrology-specific large language model would require extensive curation of trusted source material that is constantly updated with updates in care guidelines. As large language models are not as adept at reasoning tasks, Miao et al. included instructions that the large language model should state when it is unable to answer a question or if the user should seek additional external medical advice. In an era of reliance on internet resources for care information, it will be important to implement such guards to protect patients from false or dangerous information. This type of maintenance and regulation will require extensive external validation, upkeep, and rigorous testing, which will be a barrier to rapid implementation.