1. Introduction
Jaundice is a common clinical condition characterized by the yellowing of the skin and sclera, which results from elevated bilirubin levels in the blood. This condition may signal various underlying health issues, such as liver disease, hemolysis, or bile duct obstruction [
1]. The early detection and accurate diagnosis of jaundice are critical for the effective treatment and management of these underlying conditions [
2,
3].
Non-invasive screening techniques have several advantages; they reduce discomfort and anxiety for patients, lower the risk of infection, and are more accessible, especially in resource-limited settings [
3,
4]. In addition, non-invasive techniques enable faster clinical decision-making and early interventions. The ability to perform rapid and frequent screenings can improve patient outcomes and reduce healthcare costs [
5]. In particular, bilirubin levels are traditionally measured through blood tests. However, there is no dedicated screening test specifically for jaundice.
Recent advancements in medical technology have underscored the potential of non-invasive diagnostic techniques. Clinically, evaluating jaundice starts with distinguishing unconjugated from conjugated hyperbilirubinemia. Unconjugated forms relate to increased production (e.g., hemolysis), impaired uptake, or defective conjugation (e.g., Gilbert’s syndrome). Conjugated forms are due to hepatocellular injury or cholestasis—either intrahepatic (e.g., hepatitis, drugs, sepsis) or extrahepatic (e.g., bile duct obstruction). This biochemical approach and classification help narrow down causes and further guide workup. The yellow discoloration of the sclera and changes in urine color are hallmark symptoms of jaundice, presenting opportunities for the development of non-invasive diagnostic tools [
2,
6,
7]. Dark urine is often an early sign of conjugated hyperbilirubinemia, reflecting the increased renal excretion of water-soluble bilirubin. It commonly occurs in the setting of hepatocellular injury or cholestasis and may appear before overt jaundice becomes clinically evident. However, dark urine is not specific to liver disease and can also result from hematuria, hemoglobinuria, or myoglobinuria. Therefore, a brief visual assessment of urine color may serve as a simple screening clue for hepatic dysfunction when interpreted in the proper clinical context.
Leveraging recent advances in machine learning and artificial intelligence, combined with image processing techniques, holds significant clinical promise. These technologies can analyze visual data with high precision, potentially identifying jaundice symptoms with greater accuracy and speed than traditional methods. In this study, we aimed to develop an artificial intelligence program capable of predicting jaundice based on sclera images and changes in urine color. Our objective was to assess whether this program can serve as an effective screening tool in clinical settings and to compare its accuracy with that of the gold standard—namely, blood tests. This paper presents the development process, technical specifications, and validation results of our program, illustrating its potential for jaundice screening based on experimental findings.
2. Materials and Methods
The Transparent Reporting of an artificial intelligence (AI)-powered multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD + AI) checklist is provided in
Supplementary Table S1 [
8]. All authors had access to the study data and reviewed and approved the final manuscript.
2.1. Ethical Considerations
The study protocol was reviewed and approved by the Institutional Review Board of the Soonchunhyang University College of Medicine (IRB No. SCHBC 2023-12-009-001). The study was registered on cris.nih.go.kr (Identifier: Clinical Research Information Service (CRiS) of Republic of Korea, KCT0009915). Informed consent was waived due to the retrospective design of the study.
2.2. Patients and Acquisition of Pictures of Sclera and Urine
The study included patients who visited the gastroenterology department with abnormal liver function, liver disease, or biliary tract disease and underwent blood and urine tests from October 2022 to October 2023. As a control group, we enrolled patients who visited the same center but had no specific underlying liver or biliary tract disease and were at low risk for jaundice and hyperbilirubinemia.
Scleral photography was performed by placing a mask over the patient’s nose and mouth, exposing only one eye. Each patient had only one scleral photograph taken from the right eye. Consequently, each patient’s scleral image was used only once, either in the training or validation set. An A4 sheet from Hankuk Paper (Ulsan, Republic of Korea) was positioned beside the patient’s face, occupying half of the screen. The inclusion of the A4 paper in the photograph aimed to correct the white balance, thus minimizing errors in scleral color between the different photographs (
Figure 1).
Urine color was documented by placing the specimen container against a white background at the reception desk. All urine photographs were taken at this same fixed location under consistent lighting conditions, using the white surface of the background itself as a reference for white balancing.
All patient specimens were photographed at this same location. The photographs were captured using an iPhone 12 Pro (Apple Inc., Cupertino, CA, USA) with the live photo feature disabled and all other settings at their default values. The entire process, including explaining the procedure to the patient and completing the photograph acquisition, was completed within five minutes.
The study investigated various factors, including patients’ age, sex, underlying liver disease, total bilirubin, direct bilirubin, aspartate aminotransferase (AST), alanine transaminase (ALT), and alkaline phosphatase (ALP). The normal reference ranges at our institution were as follows: total bilirubin, 0.2–1.5 mg/dL; direct bilirubin, 0.0–0.2 mg/dL; AST, 5–45 U/L; ALT, 0–40 U/L; and ALP, 30–120 U/L. Blood tests and photography were conducted within a maximum time difference of three days.
A similar study with a comparable experimental design involving 130 patients reported a Pearson correlation coefficient of 0.7 (
p-value < 0.001) between bilirubin levels and the measured parameters [
9]. Based on this correlation, we estimated the minimum required sample size using Cohen’s effect size calculation for correlation studies. With a significance level (α) of 0.05 and a statistical power of 0.8, the required sample size was determined to be 17 patients. However, to enhance the robustness and generalizability of our findings, we collected a substantially larger sample size than this minimum requirement.
2.3. Primary Outcome and Evaluation
The primary outcome of this study was to predict bilirubin levels and the presence of jaundice using machine learning algorithms. For the prediction of bilirubin levels, we performed regression analysis to obtain the R
2 value and
p-value, assessing the strength and significance of the model’s predictions. For the diagnosis of jaundice, we used a confusion matrix to calculate key performance metrics, including accuracy, precision, recall, and the F1 score, to evaluate the classification capability of the model.
2.4. Definition of Threshold Values
The term “jaundice” is synonymous with hyperbilirubinemia, where normal serum levels of total bilirubin are typically less than 1 mg/dL [
10]. Clinically, jaundice manifests as scleral icterus when serum bilirubin levels exceed 3 mg/dL [
10]. Based on this clinical observation, we categorized bilirubin levels above 3 mg/dL as abnormal and those at or below this threshold as normal. Initially, we calculated the accuracy of the algorithm using this classification. After identifying the most effective algorithm, we then fine-tuned the bilirubin threshold used to differentiate between normal and abnormal cases. This threshold was adjusted from 1.0 to 4.0 in 0.1 mg/dL increments to establish the optimal level.
2.5. Image Analysis
The default settings were used for smartphone photography; however, this may lead to white balance issues, making it difficult to compare colors under the same conditions. To account for minor environmental variations, such as changes in lighting or the patient’s skin color, we used A4 paper to correct the colors. To enhance robustness against varying lighting conditions, we applied the von Kries transformation as follows:
We used this formula to normalize the color and brightness of all images based on the pixel values of the white reference paper (
Figure 2). The images were then converted to the YCbCr color space and analyzed [
11].
Since bilirubin levels and the presence of jaundice directly influence the color of a patient’s sclera and urine, we developed a model to predict these factors based on pixel values from these regions. Although the latest deep learning techniques employ depth and complexity in their architecture, which could potentially eliminate the need for image correction, they are significantly limited by their requirement for large datasets. Given the limited dataset available for this study, we concentrated on extracting color data from specific regions of interest through image preprocessing, which is a process analogous to feature selection in traditional machine learning techniques. As a result, we implemented well-known conventional machine learning algorithms, such as Decision Tree, Random Forest, and XGBoost [
12,
13,
14]. Additionally, we utilized the DeepSets model within a neural network framework [
15]. Finally, we included the ResNet-18 and ResNet-50 models [
16], which are fundamental deep learning algorithms in the convolutional neural network family, to compare overall performance. Among the ResNet models, we adopted ImageNet-pretrained weights, which are well-suited for achieving strong performance even with a smaller dataset [
17].
The input structure (or input layer dimension) varied across the models. For ResNet-18 and ResNet-50, which are based on the CNN architecture, input images were resized to 224 × 224 × 3. For the DeepSets model, input data consisted of YCbCr values from the region of interest (RoI), which was determined by masking (width × height × 3). For conventional machine learning models, input was composed of a one-dimensional vector of 1567 extracted features. Further details are provided in
Supplementary Text S1.
All experiments were conducted using stratified 5-fold cross-validation. To minimize the impact of randomness and reduce discrepancies between the distributions of the train and validation sets, we divided these sets based on total bilirubin levels. Initially, cases were categorized as normal or abnormal according to their bilirubin levels, and the sets were assembled to maintain a consistent ratio between these two groups. Further information on the fine-tuning of hyperparameters and the training settings for each algorithm is available in
Supplementary Text S2.
2.6. Statistical Analysis
All statistical analyses were performed using Python (v3.9) and the SciPy and Scikit-learn libraries. Continuous variables were expressed as means ± standard deviations or medians with interquartile ranges (IQRs), depending on the distribution. Categorical variables were presented as frequencies and percentages. To compare the differences between the liver disease group and control group, we used the Wilcoxon rank-sum test for continuous variables, as the data were not normally distributed, and the Chi-square test for categorical variables. A p-value < 0.05 was considered statistically significant.
4. Discussion
Various liver diseases, including impairments in hepatic metabolism or bilirubin transport or injury to any part of the hepatobiliary system, may result in hyperbilirubinemia [
18]. In hepatobiliary patients, the early detection of jaundice enables timely intervention and helps prevent potential liver failure. Therefore, numerous studies have been conducted to develop a non-invasive, easy-to-use method for the early detection of jaundice. A notable example of such a method is transcutaneous bilirubinometry, which is commonly used in newborns [
19]. This non-invasive technique measures skin reflectance at specific wavelengths of light to assess the extent of skin yellowing, thereby determining the severity of neonatal jaundice. Its simplicity has made it a popular choice in neonatal care units. Recently, the widespread availability of smartphones and advances in image processing have prompted efforts to analyze changes in scleral or skin color due to hyperbilirubinemia in adults, where traditional transcutaneous bilirubinometry is not applicable.
For the utilization of scleral images, the latest deep learning technologies could autonomously identify scleral regions if extensive datasets were available. However, with limited patient data, it becomes necessary to segment the scleral area and analyze the color values from this specific region. Essential steps in this process include white balance or color adjustment, as varying lighting conditions during image capture can significantly affect the color temperature. Part et al. [
9] employed a patch made from a sheet of white paper with a rectangular hole, while Mariakakis et al. [
20] used paper glasses when taking a facial photo. Our study adopted a simpler method, using only an A4 sheet of white paper for color correction, which facilitated the effective preprocessing of the scleral images. Some studies have developed and utilized equipment such as goggles to control light for research purposes [
7,
21], whereas others have relied solely on images without incorporating any additional equipment [
6,
22,
23,
24].
Jaundice prediction using images involves two primary tasks: predicting the exact bilirubin levels through regression and determining the presence or absence of jaundice, measured by accuracy. In their BiliScreen research, Mariakakis et al. [
20] reported a Pearson correlation coefficient of 0.78 when using a smartphone and goggles. When using a box designed to block out ambient light, the correlation coefficient increased to 0.89. This underscores the importance of a controlled environment for accurate predictions. Similarly, our approach involved capturing images in an open setting, achieving a correlation coefficient of 0.79, which is consistent with previous studies. Prajapati et al. [
22] achieved a correlation coefficient of 0.96 using their jScan smartphone application, demonstrating a high level of accuracy in its predictions.
For diagnosing jaundice, various detailed analysis methods and equipment have been used, with most approaches that utilize scleral images reporting around 90% accuracy [
6,
7,
20,
21,
22,
24]. In our study, the DeepSets algorithm, which yielded the best results, achieved an accuracy of 0.871. Considering that most studies were conducted with limited datasets of fewer than 100 subjects and used different criteria, it is more important to focus on the potential of scleral image-based analysis rather than on performance differences. Interestingly, after developing an algorithm to predict bilirubin levels, we used the predicted values to determine the optimal threshold for distinguishing between normal and jaundiced cases by varying the threshold from 1.5 to 4.0 in 0.1 increments. Using the two algorithms that showed the highest correlation with actual bilirubin levels, DeepSets, and Random Forest, the optimal thresholds were identified as 2.6 and 2.9, respectively. Scleral icterus is detectable when total serum bilirubin levels exceed 3 mg/dL [
10]. Our study supports this well-established theory and further suggests the potential for the early detection of mild hyperbilirubinemia in the 2 to 3 mg/dL range. This level may not be easily noticeable but is higher than the normal range. This highlights the possible clinical utility of our approach for early diagnosis.
In this study, we not only utilized scleral images but also incorporated urine color analysis. This approach, though not previously attempted, was deemed viable due to the established correlation between jaundice and urine color [
25]. Urine images were captured against a white background under consistent lighting conditions, eliminating the need for white balancing. However, the accuracy of predictions based on urine images did not match those made using only scleral images. Even when combined with scleral images, the results were still less accurate than those obtained from scleral images alone.
The limitations of this study include its reliance on a limited number of scleral images from both patients and healthy individuals, which restricts the analysis and prevents it from reflecting the real-world distribution of cases. Additionally, the clinical utility of this predictive model cannot be confirmed. To implement this model in clinical practice, a more extensive dataset and images captured in diverse environments will be necessary to minimize potential errors and improve its robustness.
Nevertheless, we demonstrate the potential for predicting bilirubin levels using scleral images, emphasizing the utility of a simple tool such as A4 white paper for color correction. Although the results of urine-based predictions were less effective, incorporating urine analysis into this study adds significant value.