Review Reports - Joint Image Processing with Learning-Driven Data Representation and Model Behavior for Non-Intrusive Anemia Diagnosis in Pediatric Patients

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

A new deep learning approach named Recurrent Expansion Network (RexNet) is proposed in the manuscript for classification of the three types of images for diagnosis of the anemia for pediatric patients. RexNet was developed based on recurrent expansion rule included a trained LSTM model as a layer. The developed image processing pipeline extracts 181 features from 13 categories. the subsequent feature selection algorithm idetified most important features. The developed classifier was tested on three public datasets: conjunctival eye images, palmar images, and fingernail images. The augmented images were removed from the datasets and the balancing technique was used to increase the number of nonanemic samples. RexNet achieved an overall evaluation accuracy of over 99% across all classification metrics, emonstrating significant improvements in diagnostic results.

Though the paper is well written and easy to follow some points are to be enhanced before publication.

It would be good to mention the age of children in the abstract and introduction

Age and gender statistics for the datasets is to be provided

The font size is to be enlarged in Fig.9, and the actual numbers of confusion matrix are to be provided

line 210 As mentioned

line 269 As there are only 2 classes I think it is worth describe the balancing more precisely

line 338 Methods should be section number 3

Author Response

Comment 1: A new deep learning approach named Recurrent Expansion Network (RexNet) is proposed in the manuscript for classification of the three types of images for diagnosis of the anemia for pediatric patients. RexNet was developed based on recurrent expansion rule included a trained LSTM model as a layer. The developed image processing pipeline extracts 181 features from 13 categories. the subsequent feature selection algorithm identified most important features. The developed classifier was tested on three public datasets: conjunctival eye images, palmar images, and fingernail images. The augmented images were removed from the datasets and the balancing technique was used to increase the number of nonanemic samples. RexNet achieved an overall evaluation accuracy of over 99% across all classification metrics, demonstrating significant improvements in diagnostic results. Though the paper is well written and easy to follow some points are to be enhanced before publication.

Response 1: Thank you for your feedback. I have carefully addressed all comments point by point, and the changes are highlighted in red in the revised manuscript.

Comment 2: It would be good to mention the age of children in the abstract and introduction

Response 2: Thank you for your comment. The age of children is added to the abstract and introduction in revised manuscript clarifying this issue.

Comment 3: Age and gender statistics for the datasets is to be provided

Response 3: Thank you for your comments. We have inserted the following table and description to provide such important information.

Table 2 presents an overview of the datasets collected from eye conjunctival images, summarizing key features such as hemoglobin level, age, severity of anemia, gender distribution, and hospital sources. Although specific details about the individual datasets are not available, it is important to note that all datasets were compiled by the same research team during closely aligned collection periods. This consistency in methodology and data gathering supports the idea that findings related to one dataset can be generalized across the others. The mean HB level is reported at 10.35 with a standard deviation of 2.25, indicating a broad range in hemoglobin levels from a minimum of 3.1 to a maximum of 15. The average age of subjects is 31.58 months, with an age range of 6 to 60 months. In terms of severity classification, the dataset shows a distribution of Mild (20.28%), Moderate (32.68%), Non-Anemic (40.28%), and Severe (6.76%). Gender distribution indicates a slightly higher proportion of males (56.90%) compared to females (43.10%). The data also includes contributions from various hospitals, with Ahmadiyya Muslim Hospital representing 18.03%, followed by Komfo Anokye Teaching Hospital at 18.87%, and other institutions contributing to the diversity of the sample.

Table 2. Overview of datasets collected from eye conjunctival images.

Feature	Mean	Standard deviation	Min	Max	Categories
Hemoglobin level	10.35	2.25	3,1	15	-
Age (months)	31.58	16.78	6	60	-
Severity	-	-	-	-	Mild (20.28%), Moderate (32.68%), Non-Anemic (40.28%), Severe (6.76%)
Gender	-	-	-	-	Female (43.10%), Male (56.90%)
Hospital	-	-	-	-	Ahmadiyya Muslim Hospital (18.03%), Bolgatanga Regional Hospital (13.38%), Ejusu Government Hospital (5.77%), Holy Family Hospital (1.13%), Kintampo Municipal Hospital (8.45%), Komfo Anokye Teaching Hospital (18.87%), Manhyia District Hospital (6.06%), Nkawie-Toase Government Hospital (12.11%), SDA Hospital (2.11%), Sunyani Municipal Hospital (14.08%)

Comment 4: The font size is to be enlarged in Fig.9, and the actual numbers of confusion matrix are to be provided

Response 4: Thank you for your comment. Figure 9 is updated as requested.

Comment 5: line 210 As mentioned, line 269 As there are only 2 classes I think it is worth describe the balancing more precisely.

Response 5: Thank you for your comment. The paragraph at same location is updated to the flowing providing more details about data balancing.

“To address class imbalance in the dataset, we employed the Synthetic Minority Over-sampling Technique (SMOTE), a method grounded in the K-nearest neighbors (KNN) algorithm [46]. SMOTE generates synthetic samples for minority classes by identifying k-nearest neighbors using a default of five neighbors and creating new instances along the line segments that connect these neighbors. This approach enriches the representation of the minority class, allowing the model to learn more effectively from its feature space and enhancing the decision boundary between classes. Additionally, our processing pipeline included data cleaning to remove inconsistencies, feature extraction to highlight relevant attributes, and normalization to ensure proper scaling. This comprehensive preparation not only balanced the dataset but also facilitated robust analysis and model training, ultimately improving predictive performance and reliability.”

Comment 6: line 338 Methods should be section number 3

Response 6: Thank you very much for your comment. Section methods updated accordingly.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Looking from the point of view of methodology of planning and presentation of scientific results, the article is suitable for publication. The authors extracted the features of the images (using 2 representative samples of the data) and used appropriate methods to extract those features that have the greatest potential as inputs to binary image classifiers.

However, the distribution of data belonging to two different classes, obtained by image preprocessing and feature extraction, shown in Figure 5, presents two almost completely linearly separable sets (regardless of the choice of data source: Palmar images, Eye conjunctiva images and Fingernail images).

With this kind of distribution of data in classes, it is impossible, in the opinion of the reviewer, to demonstrate the superiority of the RexNet model over the LSTM model. Both approaches produce the same ideal classification effect, as evidenced by the ROC curves shown in Figure 8 and the values of the area under such curve, practically equal to 1 (I consider the value of 0.99994 in Figure 8b to be equal to 1).

Therefore, I believe that the usefulness of the RexNet approach has not been proved by the Authors, which translates into my low rating in the categories: Scientific Soundness and Interest to Readers.

Please, change the descriptions in Fig. into "Dropout layer".

Author Response

Comment 1: Looking from the point of view of methodology of planning and presentation of scientific results, the article is suitable for publication. The authors extracted the features of the images (using 2 representative samples of the data) and used appropriate methods to extract those features that have the greatest potential as inputs to binary image classifiers. However, the distribution of data belonging to two different classes, obtained by image preprocessing and feature extraction, shown in Figure 5, presents two almost completely linearly separable sets (regardless of the choice of data source: Palmar images, Eye conjunctiva images and Fingernail images). With this kind of distribution of data in classes, it is impossible, in the opinion of the reviewer, to demonstrate the superiority of the RexNet model over the LSTM model. Both approaches produce the same ideal classification effect, as evidenced by the ROC curves shown in Figure 8 and the values of the area under such curve, practically equal to 1 (I consider the value of 0.99994 in Figure 8b to be equal to 1). Therefore, I believe that the usefulness of the RexNet approach has not been proved by the Authors, which translates into my low rating in the categories: Scientific Soundness and Interest to Readers.

Response 1: Thank you for your comment. In response to the reviewer's concerns, I would like to clarify that my evaluation of the RexNet model involved multiple performance metrics, not solely the Area Under the Curve (AUC). These diverse metrics provide a comprehensive insight into RexNet's capabilities, capturing various aspects of model performance beyond mere accuracy. This multifaceted approach is crucial, particularly when examining the effectiveness of machine learning models on small-scale datasets and traditional deep learning architectures. The scientific novelty of RexNet extends beyond its immediate accuracy; it aims to address future challenges with more complex data. My primary goal is to illustrate that there is a significant distinction in learning performance, especially highlighted by the loss function, which demonstrates smoother and faster convergence in RexNet compared to traditional methods. This suggests that RexNet's architecture is better equipped to handle various data complexities and offers a promising foundation for advancing image classification tasks in future research.

Comment 2: Please, change the descriptions in Fig. into "Dropout layer".

Response 2: Thank you for your comment. The figure is revised and updated accordingly.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Anemia in children is diagnosed from images of eye sclera, palm and fingernails. Image features are extracted by traditional methods and then passed to rather complex neural network, based on LSTM approach.

There are the following drawbacks.

1. Ideology. It is not clear, why the author has chosen LSTM for the study. This architecture is primarily targeted to discover patterns in time series. The nature of time series is that the data in the input are not equivalent: next data depend on previous, but previous data do not depend on the next. But the study does not involve time series, it uses images. In images all data in pixels are equivalent in causal sense. Of course, LSTM model manages to process images, but this is not optimal. The author should provide a clear explanation why LSTM is preferred and how image data is transformed to time-like series.

2. Mathematics. Formula (17) is said to be a normalization to [0;1] range. This is clearly not so, it can yield negative values.

3. Sometimes images are denoted as I(a,b) with two indexes. Sometimes images are denoted as P_i with single index i. Sometimes as \mu(t) with index t. This should be unified to one way.

4. The procedure of feature selection is not clear. What procedure was used to select the features from initial set of 181 ones? What were basic assumptions?

5. Presentation. Figures 5 given number of features in percents. It is not informative. Number of features should be presented directly.

6. Presentation. Figure 8 is not informative. No difference can be found in ROC curves in straight scale. Logarithmic scale should be used to reveal the difference.

7. Presentation. Figure 9 is not informative. It is a good practice to draw confusion matrices in color to give a qualitative picture, if number of classes is dozen or more. For two classes it is senseless, since full understanding may be achieved from four numbers. Four numbers in each confusion matrix should be given, maybe with color background.

8. Table 3. Time is given. But what is this time for? Is it time of single image processing?

9. Size of the networks (number of trainable weights) should be given.

10. SMOTE term on page 6 should be decyphered prior to its use.

Author Response

Comment 1: Anemia in children is diagnosed from images of eye sclera, palm and fingernails. Image features are extracted by traditional methods and then passed to rather complex neural network, based on LSTM approach. There are the following drawbacks.

Response 1: Thank you for your feedback. I have carefully addressed all comments point by point, and the changes are highlighted in red in the revised manuscript.

Comment 2: It is not clear, why the author has chosen LSTM for the study. This architecture is primarily targeted to discover patterns in time series. The nature of time series is that the data in the input are not equivalent: next data depend on previous, but previous data do not depend on the next. But the study does not involve time series, it uses images. In images all data in pixels are equivalent in causal sense. Of course, LSTM model manages to process images, but this is not optimal. The author should provide a clear explanation why LSTM is preferred and how image data is transformed to time-like series.

Response 2: Thank for your comment. This statement is inserted in methods section clarifying this issue.

“As previously mentioned, this work adopts RexNet, which builds upon both the LSTM network and recurrent expansion rules [24,49]. While LSTM networks are traditionally designed for time-series data, they were chosen here due to their ability to capture long-term dependencies and complex patterns. In this study, the image data is first preprocessed to extract critical features, such as texture and shape, transforming them into sequences that the LSTM can efficiently process. Each feature vector is treated as a step in a sequence, allowing the LSTM to learn relationships between different aspects of the image. This approach enables the model to capture intricate patterns and correlations within the features extracted from the images, enhancing the diagnosis of anemia in young children. Furthermore, by integrating these extracted image features with other factors such as age, the model leverages LSTM's strength in managing sequential data, providing a richer, more accurate diagnostic capability.”

Comment 3: Formula (17) is said to be a normalization to [0;1] range. This is clearly not so, it can yield negative values.

Response 3: Thank you for your comment. This equation is updated accordingly.

Comment 4: Sometimes images are denoted as I(a,b) with two indexes. Sometimes images are denoted as P_i with single index i. Sometimes as \mu(t) with index t. This should be unified to one way.

Response 4: I used different notations because the image passes through various stages of processing, such as segmentation, scaling, and binarization. Logically, it's no longer the same image after these transformations. By the time it reaches , which represents the extracted features, it has been converted into a numerical representation and is no longer an image in the traditional sense.

Comment 5: The procedure of feature selection is not clear. What procedure was used to select the features from initial set of 181 ones? What were basic assumptions?

Response 5: Thank you for your comment.

Comment 6: Figures 5 given number of features in percents. It is not informative. Number of features should be presented directly.

Response 6: Thank you for your comment. To clarify, Figure 5 does not represent the number of features but rather a data scatter plot illustrating the class distributions. As such, the figure is intended to showcase how the data is distributed between the classes, and it does not specifically depict the number of features used. Displaying the exact number of features in this type of scatter plot would be challenging and may not provide additional clarity. However, we believe the reviewer may have been referring to Figure 4, which have now been updated accordingly.

Comment 7: Figure 8 is not informative. No difference can be found in ROC curves in straight scale. Logarithmic scale should be used to reveal the difference.

Response 7: Thank you for your comment. In response, I have added an additional subplot in Figure 8, which showcases a zoomed-in view of the ROC curve for the Palm dataset. This enhancement illustrates the differences in ROC performances more clearly and emphasizes the variations between the LSTM and RexNet models.

Comment 8: Figure 9 is not informative. It is a good practice to draw confusion matrices in color to give a qualitative picture, if number of classes is dozen or more. For two classes it is senseless, since full understanding may be achieved from four numbers. Four numbers in each confusion matrix should be given, maybe with color background.

Response 8: Thank you for your feedback. I have updated Figure 9 by adding numerical values to the confusion matrices while maintaining the original color scheme. This enhancement provides clearer insights into the classification results.

Comment 9: Table 3. Time is given. But what is this time for? Is it time of single image processing?

Response 9: Thank you for your comment. I have improved Table 4 to clearly display the training time in seconds. This enhancement provides a more precise understanding of the training duration.

Comment 10: Size of the networks (number of trainable weights) should be given.

Response 10: Thank you for your comment. We have added the following section to the manuscript in the Results section to clarify this issue.

“Table 6 presents an analysis of network sizes, highlighting the trainable parameters of both LSTM and RexNet across the three medical imaging datasets used for anemia diagnosis. For the Eye dataset, RexNet exhibits a significant number of parameters (21,802) compared to LSTM (5,698), indicating its capacity to capture complex features in medical images. Conversely, in the Fingernails dataset, LSTM has more trainable weights (29,642) than RexNet (12,451), suggesting it may be better suited for that specific dataset. The Palm dataset follows a similar trend, with LSTM again showing a higher count of parameters (42,552) than RexNet (17,875). Overall, RexNet's ability to extract intricate features, coupled with its generally lower parameter count in two datasets, makes it an attractive option for medical image analysis. Therefore, it is recommended to prioritize RexNet for anemia diagnosis, especially in contexts where image complexity is high, as it can enhance diagnostic accuracy while potentially reducing training time and mitigating the risk of overfitting.”.

Table 6. Networks sizes (trainable parameters).

Dataset	Algorithm	Trainable weights
Eye	LSTM	5698
Eye	RexNet	21802
Fingernails	LSTM	29642
Fingernails	RexNet	12451
Palm	LSTM	42552
Palm	RexNet	17875

Comment 10: SMOTE term on page 6 should be decyphered prior to its use.

Response 10: Thank you for your comment. The term "SMOTE" has been defined upon its first usage as requested.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The article is interesting and presents a real problem that is approached with an interesting methodology and, in addition, good results are obtained.

One of the problems with the article is that the advantage of using imaging and its processing to detect anemia when a simple blood test would be more effective and error-free is not clear. The problems noted such as discomfort, risk of infection and difficulties of frequent monitoring are not such in a global context (although certain countries and settings may be).

The following are some of the errors detected and some considerations on the results obtained:

* In Figure 2 (Flowchart) appears "...resuling in 181 features" and must be "...resulting in 181 features"

* Line 210 starts with "A mentioned" and should be "As mentioned"

* In line 241 it is said "l²_a and l²_b are the length of the semi-major axis" when, in fact, they are squared lengths.

* In line 257 is said "Contarst" instead of "Contrast"

* In equations where there is a sumatory the limits are not defined. They should be to avoid confusions.

* It is not clear the feature extraction since there is no way to know each of the 181 components obtained. The extraction process should be clarified.

* As the authors rightly say, there is a problem with the imbalance in the data referring to the palmar and nail images. Although an attempt is made to reduce the problem, the author himself comments that the results are not entirely realistic. The solution is to obtain balanced data, i.e., more images.

* The RexNet networks used can be confusing because if you search the Internet, you will find other RexNet (Rank Expansion Networks) that can be consulted at https://arxiv.org/abs/2007.00992.

* In figure 6 appears a "Drpout layer" that should be "Dropout layer"

* Figures 8 and 9 are not very informative and could be eliminated since, visually, they show no difference between LSTM and RexNet.

* In table 3 there are two typos. "Mathod" should be "Method" and "Plam" (eighth row) should be "Palm".

Author Response

Comment 1: The article is interesting and presents a real problem that is approached with an interesting methodology and, in addition, good results are obtained. One of the problems with the article is that the advantage of using imaging and its processing to detect anemia when a simple blood test would be more effective and error-free is not clear. The problems noted such as discomfort, risk of infection and difficulties of frequent monitoring are not such in a global context (although certain countries and settings may be).

Response 1: Thank you for your comment. The following statement is added to the introduction clarifying this issue.

“While traditional blood tests are the gold standard for diagnosing anemia, this study emphasizes the potential of imaging techniques as a complementary approach, particularly in contexts where access to laboratory facilities is limited or frequent blood draws pose challenges. Imaging offers unique benefits, such as non-invasiveness and the ability to capture additional physiological information, which can enhance accessibility in underserved or resource-limited environments. This research aims to highlight how imaging can serve as an adjunct to conventional methods, addressing the need for innovative diagnostic solutions in global health contexts.”

Comment 2: The following are some of the errors detected and some considerations on the results obtained: In Figure 2 (Flowchart) appears "...resuling in 181 features" and must be "...resulting in 181 features". Line 210 starts with "A mentioned" and should be "As mentioned". In line 241 it is said "l²_a and l²_b are the length of the semi-major axis" when, in fact, they are squared lengths. In line 257 is said "Contarst" instead of "Contrast". In equations where there is a sumatory the limits are not defined. They should be to avoid confusions.

Response 2: Thank you for your comments. I have corrected the identified errors and added a note regarding the summation limits, clarifying that they represent the number of pixels. This adjustment simplifies the equations and helps avoid any potential confusion.

“Note: The summations in equations (4)-(16) represent the number of pixels in the respective images and are omitted for clarity.”

Comment 3: It is not clear the feature extraction since there is no way to know each of the 181 components obtained. The extraction process should be clarified.

Response 3: Thank you for your comment. This section of the text is specifically dedicated to explaining the feature extraction process.

“As mentioned, the initial step involved cleaning the datasets by removing augmented images and retaining only the original ones. This ensured that the dataset contained only authentic samples for analysis. Additionally, filenames were modified to include relevant labels based on provided information and excel sheets provided within datasets, facilitating a streamlined processing workflow. This preparation phase allowed for a unified approach in handling all images through a consistent pipeline. Next, Feature extraction was a critical phase of the processing pipeline, involving a series of systematic preprocessing steps for each image. Initially, images were resized to a uniform dimension of 128 x 128 pixels to ensure consistency in the input sizes for subsequent processing steps. This standardization is essential for reliable feature extraction and analysis. Afterwards, the images were converted to grayscale to simplify the data and then normalized to adjust pixel values to have zero mean and unit variance [30]. This normalization process adjusts the image data to standardize the scale of pixel values across all images. Segmentation followed, where the normalized images were binarized using a global threshold determined by Otsu's method [38]. This step isolates relevant regions of interest in the images by converting them into binary form based on the computed threshold. Subsequently, the segmented images were analyzed to extract region properties such as area, centroid, eccentricity, and solidity [38]. These properties provide essential information about the shapes and distributions of features within the segmented regions. In the next step, the images were converted to HSV and Lightness, LAB color spaces to extract detailed color features [39,40]. Statistical moments, including mean, standard deviation, and skewness for each color channel, were calculated to quantify the color characteristics of the images. For texture analysis, features were extracted using the GLCM and LBP [41–43]. GLCM features such as contrast, correlation, energy, and homogeneity were used to describe the texture of the images, while LBP features provided additional texture information based on local patterns. The equations that define these processes are describe in (1-16). and are the normalized and gray scaled image respectively, and are the mean and standard deviation respectively of . is the threshold and are the weight of the class below the threshold , the mean intensity of the class below the threshold , and the global mean intensity of the respectively. are coordinates of pixels in the image . is the area of the region. and are centroids of and respectively. is the Eccentricity. and are the squared lengths of the semi-major axis, the length of the semi-minor axis of the ellipse that has the same second moments as the region. is solidity and is the he area of the convex hull surrounding the region. , , and are extracted color moments statistics. is the pixel value. are local binary features and are the index of neighborhood pixels. Note: The summations in equations (4)-(16) represent the number of pixels in the respective images and are omitted for clarity.

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

(15)

(16)”

Comment 4: As the authors rightly say, there is a problem with the imbalance in the data referring to the palmar and nail images. Although an attempt is made to reduce the problem, the author himself comments that the results are not entirely realistic. The solution is to obtain balanced data, i.e., more images.

Response 4: Thank you for highlighting this important issue regarding data imbalance in our study, particularly concerning the palmar and nail images. We recognize that while we made efforts to mitigate this problem, the results still reflect some limitations. As you pointed out, obtaining a more balanced dataset with a greater number of images will be crucial for enhancing the reliability and realism of our findings. We plan to focus on this aspect in future work to improve the robustness of our analyses and conclusions.

Comment 5: * The RexNet networks used can be confusing because if you search the Internet, you will find other RexNet (Rank Expansion Networks) that can be consulted at https://arxiv.org/abs/2007.00992.

Response 5: Thank you for your comment. We appreciate your observations regarding the potential for confusion with the RexNet terminology. While we acknowledge that there are other models with similar abbreviations, we believe it is acceptable as long as our paper clearly defines the term. The choice to retain the name "RexNet" is intentional, as it aligns with the original work on recurrent expansion notations that we are building upon. We made it sure that this connection is made explicit in the text to avoid any ambiguity.

Please see section 3 “…As previously mentioned, this work adopts RexNet, which builds upon both the LSTM network and recurrent expansion rules [24,49]…..”

Comment 6: In figure 6 appears a "Drpout layer" that should be "Dropout layer"

Response 6: Thank you for your comment. This error has been corrected and figure is updated.

Comment 7: Figures 8 and 9 are not very informative and could be eliminated since, visually, they show no difference between LSTM and RexNet.

Response 7: Thank you for your comment. We appreciate your feedback regarding Figures 8 and 9. We have invested considerable effort in these figures, not only for their aesthetic appeal but also to ensure they convey meaningful information. To enhance their informative value, we have included zoomed-in sections in Figure 8 and added numerical reference points in both figures.

Comment 8: In table 3 there are two typos. "Mathod" should be "Method" and "Plam" (eighth row) should be "Palm".

Response 8: Thank you for your comments. These typos are corrected accordingly.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

The author has corrected the paper according to the comments. Now the paper can be published.