Next Article in Journal
Three-Dimensional Digitization of Documentation and Perpetual Preservation of Cultural Heritage Buildings at Risk of Liquidation and Loss—The Methodology and Case Study of St Adalbert’s Church in Chicago
Next Article in Special Issue
Melanoma Skin Cancer Identification with Explainability Utilizing Mask Guided Technique
Previous Article in Journal
Towards Fully Autonomous UAV: Damaged Building-Opening Detection for Outdoor-Indoor Transition in Urban Search and Rescue
Previous Article in Special Issue
Random Convolutional Kernels for Space-Detector Based Gravitational Wave Signals
 
 
Article
Peer-Review Record

Entity Matching by Pool-Based Active Learning

Electronics 2024, 13(3), 559; https://doi.org/10.3390/electronics13030559
by Youfang Han and Chunping Li *
Reviewer 1: Anonymous
Reviewer 3: Anonymous
Electronics 2024, 13(3), 559; https://doi.org/10.3390/electronics13030559
Submission received: 18 December 2023 / Revised: 23 January 2024 / Accepted: 29 January 2024 / Published: 30 January 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In this manuscript, the authors presented a comprehensive study about using active learning and query strategy to reduce the manually labeling for entity matching problems and, at the same time, maintain the performance. The paper is well structured and the methods and results are also well presented. Overall, this work stands out as an intriguing, innovative, and significant contribution to the community. Nevertheless, there are a few questions and comments that need addressing before a recommendation for publication.

1.      The manuscript lacks explicit details regarding model training and validation. Although the method is quite clear, crucial procedures for training the model and obtaining results are omitted. To enhance accuracy and reproducibility, it is recommended that the authors provide more details on model training and validation.

 

2.      Clarification is needed regarding the fairness comparison between the current model and other models. The process by which the authors obtained experimental results for both the current and other models remains unclear. For instance, the authors introduce a new method for modifying training, test, and validation datasets. It is ambiguous whether all models utilized the same training dataset. The description suggests potential discrepancies, implying that other models may have employed different training datasets. I would suggest the authors to provide a more robust discussion when comparing model performances.

 

3.      The manuscript highlights superior performance on small datasets. However, achieving high performance on such datasets can be influenced by the data distribution in training and validation sets. The high performance of the proposed model may stem from a similar distribution between these sets, implying that data in the validation set might also be seen in the training set. To substantiate the origin of this high performance, I recommend the authors to conduct more experiments, ensuring a stable model performance for small datasets.

Author Response

Thank you very much for your time and effort in the review process. We have fed back the questions and opinions and revised our paper accordingly. The revised part is also highlighted in yellow color in the paper. Here, we will explain how each of the points is addressed.

  • Comment: "The manuscript lacks explicit details regarding model training and validation. Although the method is quite clear, crucial procedures for training the model and obtaining results are omitted. To enhance accuracy and reproducibility, it is recommended that the authors provide more details on model training and validation."

Response: Thanks for the comment and suggestion. In the revised version we have added more details for algorithm and hyperparameter explanation in Section 4.1 in order to facilitate better reproduction. It is worth mentioned that in this paper we mainly focus on the methodological design of active learning. The machine learning models in the framework of active learning are merely employed for calculating the uncertainty of samples for labeling, where base classifiers (source from the Python Sklearn package) are used for training and validation.

  • Comment: "Clarification is needed regarding the fairness comparison between the current model and other models. The process by which the authors obtained experimental results for both the current and other models remains unclear. For instance, the authors introduce a new method for modifying training, test, and validation datasets. It is ambiguous whether all models utilized the same training dataset. The description suggests potential discrepancies, implying that other models may have employed different training datasets. I would suggest the authors to provide a more robust discussion when comparing model performances."

Response: Thanks for the comment and suggestion. In this paper, all the experiments are conducted on open public datasets, which have been beforehand partitioned into standard training set, validation set and test set for the purpose of comparative study in this research field. Related works mentioned in references were also evaluated through experiments on the same partitioned sets. In our work, we do not modify the validation set and test set. For the training process, we only select less amounts of samples from the original training set by pruning strategies and active learning method. We believe it is reasonable to select less data from the train set (but not all data in training set) for highlighting the superiority of our method. The experiment results demonstrate that our method can use less labeled samples for training but get higher matching performance. We believe that it does not affect fairness as long as the test set and validation set are the same. According to your suggestion, in the revised version we have added more explanations in Section 4.1.1. Moreover, we further complement the additional experiment with other active learning methods. In the experiment, the baseline methods select the samples for labeling individually in the same training set and get the correspondent results on the same test set, With the comparative analysis, it further verifies the effectiveness of our method.

  • Comment: "The manuscript highlights superior performance on small datasets. However, achieving high performance on such datasets can be influenced by the data distribution in training and validation sets. The high performance of the proposed model may stem from a similar distribution between these sets, implying that data in the validation set might also be seen in the training set. To substantiate the origin of this high performance, I recommend the authors to conduct more experiments, ensuring a stable model performance for small datasets."

Response: Thanks for the comment and suggestion. According to your suggestion, we add the stability experiment in Section 4.3.6. By randomly partitioning the data sets we repeat the experiment for 10 times to demonstrate the stability of our method.

 

 

Reviewer 2 Report

Comments and Suggestions for Authors

Dear Authors,

Thank you for submitting your work to the Electronics-MDPI journal.

In short, your work is an active learning method for the entity-matching task. It labels a small number of valuable samples and employs them to build a model with high quality.

 

 

1.      Introduction

 

At the end of the introduction, you should add a paragraph, which represents a manuscript map, that briefly illustrates what each section in your manuscript will discuss.

 

2.      Related Works

-        Your related work section lacks the most recent works (i.e. from 2023 and 2022); you should consider some works published in these years.

 

2.1.  The Framework of Pool based Active Learning

 

-        How did you treat the variety of experts' opinions during sample labelling? Are there any criteria followed by experts for labelling?

-        Please enhance the quality of Figure 2.

 

2.2.  Data Preprocessing

- Please add an introductory paragraph to show what this subsection will discuss.

  

2.3.  Experiment Setup

-        This subsection should have an introductory paragraph to show what this section will discuss.

 

2.3.1.        Data Sets

 

-        Can you please explain what you did with labels that have multiple matches? Or partly matching? 

2.4.  Evaluation Metrics

- It would be better if you add more evaluation matrices to make you assess your work from different perspectives.

 

References:

 

Your work is lacking recent works i.e. 2023 and 2022, which should be added.

 

Thank you

Author Response

Thank you very much for your time and effort in the review process. We have fed back the questions and opinions and revised our paper accordingly. The revised part is also highlighted in yellow color in the paper. Here, we will explain how each of the points is addressed.

 

  • Comment: "Section1. At the end of the introduction, you should add a paragraph, which represents a manuscript map, that briefly illustrates what each section in your manuscript will discuss."

Response: Thanks for the suggestion. We have added the brief illustration at the end of the Introduction.

  • Comment: "Section2. Your related work section lacks the most recent works (i.e. from 2023 and 2022); you should consider some works published in these years."

Response: Thanks for the suggestion. We have added the literature survey for most recent works and references in the revised version.

  • Comment: "Section2.1. How did you treat the variety of experts' opinions during sample labelling? Are there any criteria followed by experts for labelling? Please enhance the quality of Figure 2."

Response: Thanks for the comment and suggestion. Our proposed method is mainly designed in the way of human-machine interaction, which uncertain samples can be selected automatically and continuously, and then experts further annotate the sample labels for iterative training. We assume that the sample labels annotated by experts are ground truth. We have added further explanations in Section 3.1.

In the revised version, we also have enhanced the quality of Figure 2.

  • Comment: "Section2.2. Please add an introductory paragraph to show what this subsection will discuss. 3. This subsection should have an introductory paragraph to show what this section will discuss."

Response: Thanks for the suggestion. We have added the introductory paragraph in Section 2.2 and Section 2.3 respectively.

  • Comment: "Section2.3.1. Can you please explain what you did with labels that have multiple matches? Or partly matching? "

Response: Thanks for the suggestions. Our method may solve the problem of multiple matches. As for constructing sample sets, data from different data sources will be paired one by one, so there may exist multiple matches in sample sets. Our method merely needs to determine whether the pair belongs to the same entity. Whether samples are matched for multiple times, does not affect the judgment result of the model for entity matching tasks.

Currently our method cannot solve the problem of partly matching, because partial matching can lead to lower similarity values obtained in the algorithm, resulting in the algorithm being more inclined to adjust that they are mismatched. We added further explanation and discussion on the applicability of our algorithm in the revised version.

  • Comment: "Section2.4. It would be better if you add more evaluation metrics to make you assess your work from different perspectives. "

Response: Thanks for the suggestions. For entity matching tasks, the commonly used evaluation metric is the F1 score. Other related references also merely used F1 score as the evaluation metric. Specially in reference [44], authors discussed detailly that F1 score is the most suitable metric for entity matching tasks.

As our method focuses on using the small amounts of labeled samples to reduce the workload of labeling, in revised version we complement the additional experiment on the number of labeled samples for comparative study. The experimental result further demonstrates that our method has the better performance with less labeled data.

 

 

Reviewer 3 Report

Comments and Suggestions for Authors

There is a fundamental issue with building ML models. Authors indicated that the parameters and random seed of classifiers are consistent. This represents a fundamental issue removing randomness from the ML models building. Therefore, the experiment outcomes are wrong. I am rejecting the paper based on this shortcoming. More details in the attachment.

Comments for author File: Comments.pdf

Comments on the Quality of English Language

Acceptable.

Author Response

Thank you very much for your time and effort in the review process. We have fed back the questions and opinions and revised our paper accordingly. The revised part is also highlighted in yellow color in the paper. Here, we will explain how each of the points is addressed.

 

  • Comment: "Related work section should contain a table contrasting the surveyed related work against the paper contributions."

Response: Thanks for the suggestion. We tend to survey on the related works by the way of paragraph description as many references mentioned in literature study. According to your suggestion, in the revised version we have added more explanations and remarks on current existing approaches to underline the characteristics of our method.

According to your comment, in the revised version we have added more remarks and contribution contrast on current existing approaches to underline the characteristics of our method. We keep the writing style in the way of paragraph description as many similar references mentioned in literature study.

  • Comment: "The experiment structure was difficult to follow. I didn’t see any research questions to be answered in this work."

Response: Thanks for the comment. Generally speaking, our proposed method can significantly reduce the workload for labeling samples, which can solve the problem of difficulty in obtaining labeled data for entity matching tasks. In the revised version we have added a new section of discussion to further explain the current issues and solutions in entity matching tasks.

  • Comment: "There is a fundamental issue with building ML models. Authors indicated that the parameters and random seed of classifiers are consistent. This represents a fundamental issue removing randomness from the ML models building."

Response: Thanks for the comment. The main thought of our method is the design of uncertainty strategies for active learning. ML models are merely employed to provide the means supporting for the calculation of uncertainty for labeling samples in the process of active learning. Updating the labeled data pool in each iteration, ML models are used to calculate the uncertainty degree of unlabeled data. For this purpose, it is necessary that it has less randomness for used ML models in the process of active learning. We fix the hyperparameters and the random seeds so as to ensure the stability of the active learning process. In revised version we have added explanations in Section 4.1.2.

  • Comment: "Have you done any Hyper parameters tuning? In section 4.1.3, authors mentioned the value of the initial pool size without any justifications. What about the models HP tuning?"

Response: Thanks for the comment. In the paper we use the validation set for fine-tuning. We mainly adjust the hyperparameters of classifiers to ensure the performance of entity matching in the process of active learning. It is actually difficult to find the most suitable initial pool size and the number of added labeling samples during iteration. It depends labor efforts and time costing, and the size of datasets in real applications, but the number of added labeling samples may be adjusted accordingly to adapt specific occasions.

  • Comment: "In your experiment details, authors mentioned that they used SVM, RF, KNN, and NB classifiers. However, there is no mention to them in the results discussion. What was the optimal classifier in each dataset?"

Response: Thanks for the comment. Similar to Response 3, our work doesn’t focus on validating which classifiers are more suitable. These classifiers are just the means for finding the samples with the high uncertainty for labeling, so we don’t discuss which is the optimal one of classifiers for given tasks.

Different classifiers may obtain different results on datasets. The goal of our proposed method is to integrate various types of classifiers, identify those samples that are difficult for classifiers to judge, and provide them to experts for labeling. If a sample can be well predicted by classifiers, it is not significant to be further labeled manually. Therefore, we don’t need to compare which of classifiers is better, but to find samples that are difficult for classifiers to predict. We have given further explanations in Section 4.1 and added a new section for discussions.

  • Comment: "I see the description of the validation set usage confusing. What was it used for? HP Tuning?"

Response: Thanks for the comment. The validation set is used for fine-tuning the hyperparameters and verifying the performance of intermediate classifiers in the process of active learning process. We have added the further explanations in Section 3.5 and Section 4.1.1.

  • Comment: "In Table 5, it is interesting to see the proposed method had lower F1 scores in 4 out of the 7 datasets, in comparison to DL approaches. So, why should your method be used? Just because it requires fewer label samples. "

Response: Thanks for the comment. In the revised version, we further optimize our algorithm for validation in the supplementary experiment. We update the semi-supervised optimization strategy in the process of active learning and improve the matching performance further. Our proposed method can attain the higher F1 score with less labeled samples in the comparative study. Compared with current existing active learning methods, it shows that our proposed method can reduce the number of labeled samples while ensuring high matching performance.

Entity matching tasks mainly face with the issue of lacking labeled samples. Manual labeling requires a lot of labor- and time costing. Our method aims to use less labeled data for the modeling process while ensuring the quality of samples that needs to be labeled. Our method can reduce the ineffective workload while ensuring the high performance in entity matching tasks. We have added more explanations in the revised version.

  • Comment: "There is no in-depth discussion on why the proposed method had higher F1 scores on the small-scale datasets."

Response: Thanks for the comment. As shown in experiment results, the proposed method is more suitable for small-scale datasets. In large-scale datasets, usually there are much more terms and data features, which makes it difficult to capture the matching patterns of all data when using only a small amount of labeled data. We have added the new section for further discussion and analysis.  

  • Comment: "At the end of the experiment discussion, I don’t see the reported outcomes overcame the 4 challenges mentioned in the introduction. Authors should link the experiment outcomes to these challenges."

Response: Thanks for the comment. Our experimental results can demonstrate to overcome the challenges mentioned in the Introduction. i.e., difficulty obtaining a larger number of labeled samples, data samples having imbalance distribution, some special samples relying on expert labeling, and deep-learning-based methods relying on big language models.

     1.Our proposed method can use very few labeled samples to achieve accurate  matching, which can alleviate the problem of difficult sample acquisition.

  1. The proposed method can remove a large number of mismatched samples through pruning operations. Comparing the sample distribution before and after pruning, it can be seen that the difference in the number of positive and negative samples has become smaller.
  2. Our method can find samples that are difficult to be correctly identified by the ML models (with the highest uncertainty) and hand them over to human for manual labeling.
  3. Our method can overcome the constraint of deep learning methods to certain extent and does not rely on domain knowledge or language models, by using less labeled samples to get well performance.

Our proposed method can greatly reduce the workload of labeling, and overcome the limitation of traditional ML-based methods in the entity matching task, which makes an efficient and relatively accurate solution for entity matching tasks. We have added more explanations in Section Discussion and Section Conclusion.

  • Comment: "There should be a separate section for the threats of validity, rather than having it as a paragraph in the conclusion."

Response: Thanks for the comment. According to your suggestion, we have added the Section Discussion in the revised version. In the new section, we discuss in details which scenarios our method is applicable to, which problems it can solve, and what limitations our method still faces.

 

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors addressed all of my questions and concerns. I recommend the publication of this manuscript.

Reviewer 3 Report

Comments and Suggestions for Authors

Authors have successfully addressed my comments. I recommend the paper acceptance.

Back to TopTop