Next Article in Journal
Determining the Quadratic Electro-Optic Coefficients for Polycrystalline Pb(Mg1/3Nb2/3)O3-PbTiO3 (PMN-PT) Using a Polarization-Independent Electro-Optical Laser Beam Steerer
Previous Article in Journal
Investigation of the Time Dependence of Wind-Induced Aeroelastic Response on a Scale Model of a High-Rise Building
 
 
Article
Peer-Review Record

Multi-Purpose Dataset of Webpages and Its Content Blocks: Design and Structure Validation

Appl. Sci. 2021, 11(8), 3319; https://doi.org/10.3390/app11083319
by Kiril Griazev * and Simona Ramanauskaitė
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2021, 11(8), 3319; https://doi.org/10.3390/app11083319
Submission received: 16 February 2021 / Revised: 29 March 2021 / Accepted: 6 April 2021 / Published: 7 April 2021

Round 1

Reviewer 1 Report

The paper describes a set of simple HTML webpages with some content. The goal is to have pages that are understandable for automated analysis.

The problem is, I do not really understand what the purpose of this paper is. All sections need to be heavily reworked to make it more clear what the contribution of this paper is and whether it is scientific and sound contribution that warrants publication. My impression is that the content is neither scientific nor sound. However, I might err because I do not understand the paper. I propose to reject the paper but if the authors are convinced of the scientific benefit of their work, a completely re-written version could be submitted again.

Abstract: What is „xssss“?

Section 1: I do not understand the introduction. It appears to me that the most important sentence is the last one from line 53. May be it would be good to start the introduction with this sentence and then explain more what the problem is and how you help to solve this problem.

To improve understand of the story of the paper, you could add bullet points with you concrete contributions and provide a “this paper is structured as follows” paragraph.

Section 2 (related work) does not really to be a related work section. Instead, it seems to be a description of the actual problem. Please separate the description of the problem you address (this is essential to understanding your paper) from the related work (that only necessary to understand whether your work is novel).

You state (l.58): “That structure is mainly used for layout and design of the website, rather than to identify what information is being presented in different blocks.” Is that so? Please provide evidence.

Section 3: Do not describe your research in chronological order. Instead, describe your solution first. After that provide details and why you made certain design decisions.

Section 4: I do not understand what needs to be validated and I do not understand whether the validation is suited to the problem. Why

 

Language/minor issues:

Section 2 has no introductory sentence.

English language editing necessary.

Please try to improve grammar and readability of your paper by splitting up those countless very long sentences. Just one example; this applies analogously to all parts of the paper: l. 23: The landscape of information systems is evolving, amount of digital information increases daily and so does the need to collect, process and then access stored information.  -> I propose to split the sentence into three separate ones to improve grammar and make it more easily readable.

 

Author Response

The problem is, I do not really understand what the purpose of this paper is. All sections need to be heavily reworked to make it more clear what the contribution of this paper is and whether it is scientific and sound contribution that warrants publication. My impression is that the content is neither scientific nor sound. However, I might err because I do not understand the paper. I propose to reject the paper but if the authors are convinced of the scientific benefit of their work, a completely re-written version could be submitted again.

We appreciate your honesty and understand that the text was not clearly presented therefore complicated the understanding of the paper contribution. We tried to edit it and added a clearly formatted contribution - multi-purpose dataset structure design and content block perception variety estimation. It probably is not a very significant scientific contribution. However, it is oriented towards applied sciences, engineering, therefore obtained results can be used to test web page content segmentation methods in the future.

 

Abstract: What is „xssss“?

Sorry for the error. It was corrected.

 

Section 1: I do not understand the introduction. It appears to me that the most important sentence is the last one from line 53. May be it would be good to start the introduction with this sentence and then explain more what the problem is and how you help to solve this problem.

We left the aim at the end of the section. However, we modified the introduction to highlight the problem and the need for a dataset.

 

To improve understand of the story of the paper, you could add bullet points with you concrete contributions and provide a “this paper is structured as follows” paragraph.

Contribution was added

 

Section 2 (related work) does not really to be a related work section. Instead, it seems to be a description of the actual problem. Please separate the description of the problem you address (this is essential to understanding your paper) from the related work (that only necessary to understand whether your work is novel).

As we propose a dataset structure, the related works section also focuses on existing web page datasets that store block labeling information. Additionally, it analyses existing web content block segmentation papers to understand what methods are used and what data they are using for the segmentation. Therefore we believe the related works are related to the paper, but some additional text was added to explain its relation and importance for the research.

 

You state (l.58): “That structure is mainly used for layout and design of the website, rather than to identify what information is being presented in different blocks.” Is that so? Please provide evidence.

The sentence was rewritten and a reference was added to address that most of the web page’s HTML tags are used to form layout/design and do not or not fully present the block’s content. We can also provide some statistics, but since these are related to our research, we did not include them in that section of the paper. Six websites used during our experiment consisted of 2838 tags. Out of this number 1975 tags can be counted as layout tags, these included such tags: <div>, <section>, <span>, <i> and others. The rest, 863 tags, described content. Content tags group consisted of such tags: <a>, <p>, <img>, headings (h1-h6) and others.

 

Section 3: Do not describe your research in chronological order. Instead, describe your solution first. After that provide details and why you made certain design decisions.

We took your recommendation into account and changed the order of Section 3.

 

Section 4: I do not understand what needs to be validated and I do not understand whether the validation is suited to the problem. Why

We changed the title of Section 4 to reflect this better. As we propose a structure for a dataset that stores web page block labeling information, it is difficult to evaluate how good it is. Therefore the term 'validation' might be too strong and we changed it into the term 'research'

 

Language/minor issues:

Section 2 has no introductory sentence.

An additional paragraph was added at the beginning of Section 2

 

English language editing necessary.

Please try to improve grammar and readability of your paper by splitting up those countless very long sentences. Just one example; this applies analogously to all parts of the paper: l. 23: The landscape of information systems is evolving, amount of digital information increases daily and so does the need to collect, process and then access stored information. -> I propose to split the sentence into three separate ones to improve grammar and make it more easily readable.

Text editing was done. We hope it improved the text quality

Author Response File: Author Response.docx

Reviewer 2 Report

The paper presents a new dataset composed of web pages to analyze data extractors.

I do not understand what ‘xssssof’ in the abstract means, is it a typo? I also do not understand the meaning of ‘would include as much different data points as possible.’ Why ‘would’? Is the new dataset not defined? Is it evolving? How many data points should be included?

I think Table 1, which one of the major contributions of the paper, must be better presented and explained. For instance, clean text is not explained. Also, if screenshots are not available, could it be solved by rendering the web pages using different web browsers and capturing the screenshots? Do screenshots have certain requirements that must be fulfilled? For the cells marked as ‘some entries,’ I think it is important to have a measurement: 10% of the entries? 90%? More explanations about the differences among HTML, DOM, CSS and Javascript would be interesting. 

In Section 3, the problem of having different blocks, even overlapping blocks, is introduced but not explained in detail. I think this is an important discussion and Figure 2 does not help much understand the main points.

Instead of providing the relational model of the database, the authors should provide a higher-level model like ER or UML. 

The experiment presented in Section 4 is not very convincing to me. I did not find how many respondents replied, but this seems like a crowdsourcing task and will require more time. I think it is more interesting to know how easy/challenging is the proposed dataset. I think the authors should use the dataset to compare different extraction techniques, which is their main motivation at the beginning of the paper. I agree that the results will depend on how the dataset is labeled (blocks, etc.), but I think it is reasonable to assume the authors are experts and have provided appropriate labels for the most part.

Rafael Corchuelo, which I believe is not cited in the paper, has a lot of work on data extraction from web pages and fair comparisons among data extractors using statistical methods. I think the authors should take a look to some of his papers as they are very relevant.


Typos: ‘validate it’s design’ -> ‘validate its design’
Other comments: + Section 3, Paragraph 1 should be entirely rewritten.

Author Response

The paper presents a new dataset composed of web pages to analyze data extractors.

I do not understand what ‘xssssof’ in the abstract means, is it a typo? I also do not understand the meaning of ‘would include as much different data points as possible.’ Why ‘would’?

Is the new dataset not defined? Is it evolving? How many data points should be included?

The text editing was done. We hope the quality has improved.

 

I think Table 1, which one of the major contributions of the paper, must be better presented and explained. For instance, clean text is not explained.

The description was extended and presented after the table

 

Also, if screenshots are not available, could it be solved by rendering the web pages using different web browsers and capturing the screenshots? Do screenshots have certain requirements that must be fulfilled?

The explanation for screenshot storage rather than their generation was presented. Screen resolutions are presented as well. It is possible to produce screenshots in cases when datasets store all the page assets, but based on our research, such data is not commonly stored. Furthermore, generating screenshots would require additional time to build the appropriate solution and generate the screenshots.

 

For the cells marked as ‘some entries,’ I think it is important to have a measurement: 10% of the entries? 90%?

In the table’s footnote, some explanations were added

 

More explanations about the differences among HTML, DOM, CSS and Javascript would be interesting.

Some text was added to explain how usage of those data types varies

 

In Section 3, the problem of having different blocks, even overlapping blocks, is introduced but not explained in detail. I think this is an important discussion and Figure 2 does not help much understand the main points.

Additional explanation was added, both in the text as well as in the figure caption

 

Instead of providing the relational model of the database, the authors should provide a higher-level model like ER or UML.

Thank you for the recommendation. We used simplified class diagram notation and believe that now it will be easier to understand the dataset data structure

 

The experiment presented in Section 4 is not very convincing to me.

I did not find how many respondents replied, but this seems like a crowdsourcing task and will require more time.

The text was added to address this, there were 6 respondents

 

I think it is more interesting to know how easy/challenging is the proposed dataset. I think the authors should use the dataset to compare different extraction techniques, which is their main motivation at the beginning of the paper.

The text was added before Figure 3 to define the reason for the use of the relational database. It will allow dataset presentation in different formats and structures.

 

Rafael Corchuelo, which I believe is not cited in the paper, has a lot of work on data extraction from web pages and fair comparisons among data extractors using statistical methods. I think the authors should take a look to some of his papers as they are very relevant.

Research of Rafael Corchuelo is oriented to different web page extraction aspects but has some similarities. We added three new references, where two of them are Papers of Rafael Corchuelo

 

Typos: ‘validate it’s design’ -> ‘validate its design’

Corrected

 

Other comments: + Section 3, Paragraph 1 should be entirely rewritten.

The paragraph was rewritten

Author Response File: Author Response.docx

Reviewer 3 Report

The authors state that the available datasets used for evaluation of the methods of data extraction from the web pages are specific for different approaches, because different methods use different data. They claim that the universal datasets are needed to compare the different approaches, and propose a wide set of content blocks to obtain such datasets from the web pages.

The paper presents the authors' contribution in a rather understandable way (however, some language improvement is needed, and some formulations are not so good - such as "to perform benchmarking of their performance" or "a hybrid of the above, which includes two or more of the above approaches").

Two points should be presented more clearly, in my opinion:

  1. How the authors deal with the blocks which were labelled by the respondents differently? Are only those were chosen for the datasets which were named similarly?
  2. (probably more important:) If different data extraction methods use different kinds of data, how combining of those different kinds in one dataset can help in their comparing? In the other words, if some approaches use e.g. HTML structure, other CSS or images, aren't they incomparable? It would be nice to demonstrate comparison of such different approaches using the proposed multi-purpose dataset or at least to discuss how it can be done.

Author Response

The authors state that the available datasets used for evaluation of the methods of data extraction from the web pages are specific for different approaches, because different methods use different data. They claim that the universal datasets are needed to compare the different approaches, and propose a wide set of content blocks to obtain such datasets from the web pages.

The paper presents the authors' contribution in a rather understandable way (however, some language improvement is needed, and some formulations are not so good - such as "to perform benchmarking of their performance"
or "a hybrid of the above, which includes two or more of the above approaches").

The paper text was revised and rewritten. We hope not there will be no complex, long sentences.

 

Two points should be presented more clearly, in my opinion:

How the authors deal with the blocks which were labelled by the respondents differently? Are only those were chosen for the datasets which were named similarly?

Additional text was added to explain that research was done to validate the proposed dataset structure and get some insights on directions for its further development.

 

(probably more important:) If different data extraction methods use different kinds of data, how combining of those different kinds in one dataset can help in their comparing? In the other words, if some approaches use e.g. HTML structure, other CSS or images, aren't they incomparable? It would be nice to demonstrate comparison of such different approaches using the proposed multi-purpose dataset or at least to discuss how it can be done.

Text added in the paper to explain it better. The point of comparison is to see which methods are better performing, faster, more accurate, need less resources etc. To do this, they have to be tested using the same data source. The proposed dataset provides different types of data points, thus can be used even when comparing otherwise incompatible data extraction methods. As long as the same web pages will be analyzed, the classification results will be comparable even if the methods are completely different (without dataset universality, it would not be possible to compare different types of methods otherwise)

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Thanks for the clarifications and rework of your paper. I think I now better understand what the paper is about. However, I still have difficulty to see the (scientific) contribution the paper makes.

First contribution: You state that your first contribution would be that you provide a data structure. I assume you are referring to Figure 2, where you are presenting your data structure. Is it a database schema, where you store your set of pages? I do not see that there is anything non-trivial in this figure. If you think there is, then please explain why your data structure is a research contribution. What can others learn from it?

I understand from your related work that researchers researching content analysis algorithms need some kind of benchmark toolkit. So, a valid contribution might be that you collect a large set of such web pages to build a standard benchmarking kit. But this is not what you did.

Please describe in more detail what your data set looks like and what it contains.

To make your data set valuable to others, you probably need to publish it as well, so that others can use it to evaluate their algorithms; e.g., as an appendix to the paper or archive it permanently on some web address.

Or do you provide a method to automatically synthesize huge amounts of web pages for analysis? Then you have to show that your generated web pages are representative of actual web pages.

Second contribution: Content block perception variety estimation. What is the purpose of this experiment? Please state your research question first. And then describe your research setup. How many people? Relevant background? How were they to do their rating? You could structure your experiment according to the IMRaD methodology. Google for Springer and IMRaD. They have a nice explanation. Next, present the data you collected. And then, discuss it. What is the benefit of the experiment? Why are the “collected variations” so important? What is the “content block variation problem” and why would one need an “initial variety scale”?

You claim that you “propose a dataset that would  provide various data points that can be used to benchmark or develop different types of algorithms and easily compare their performance.” I do not see how your experiment contributes to this claim.

Line 126: The word “Hero” might be wrong. May be “Title” or “Logo” or something like this would be a better word?

Author Response

-Thanks for the clarifications and rework of your paper. I think I now better understand what the paper is about. However, I still have difficulty to see the (scientific) contribution the paper makes.

-First contribution: You state that your first contribution would be that you provide a data structure. I assume you are referring to Figure 2, where you are presenting your data structure. Is it a database schema, where you store your set of pages? I do not see that there is anything non-trivial in this figure. If you think there is, then please explain why your data structure is a research contribution. What can others learn from it?

We added some additional text at the end of the introduction to highlight the scientific uncertainty.
We agree that the paper's scientific novelty is not very global, providing some methods, suitable in different areas, etc. However, we disagree it is trivial and has no scientific contribution. It is crucial to take into account the variations of the content block as it has no strict bounds. The dataset would be unflexible without taking variationts into account and by not presenting solutions on how to solve it.
Talking into account what kind of scientific novelty is presented in journal of Applied Sciences, we believe the proposed dataset structure solves more uncertainties in comparison to mobile application for rythmic dictation (https://doi.org/10.3390/app10196781), game engine usage for visuo-haptic learning (https://doi.org/10.3390/app10134553), etc.

 

 

-I understand from your related work that researchers researching content analysis algorithms need some kind of benchmark toolkit. So, a valid contribution might be that you collect a large set of such web pages to build a standard benchmarking kit. But this is not what you did.

The development of a publicly available multi-purpose dataset of web pages and their content blocks is our final goal. However, in order to create one, some steps have to be made: (1) a suitable data structure must be designed for it, (2) dataset must be filled with a large number of different data and (3) the user interface should be designed to provide the data stored in the dataset and also to fill the dataset with new data. At the moment, we feel we’ve finished the first step, but the other two are not fully implemented yet. The dataset will be available online after completing the steps mentioned before.

 

 

-Please describe in more detail what your data set looks like and what it contains.

The description dataset and stored data is presented in the Section 3. As we propose a dataset structure, we concentrate on what type of data and how it should be stored in the dataset. Meanwhile, the dataset's initial data records are presented in Section 4.1 where we present what websites were selected for proposed dataset validation, suitability estimation.
In the first version of the paper we presented a physical structure of the relational database. After we got recommendations from the reviewers, the database schema was changed into a more abstract form to simplify its readability.

 

 

-To make your data set valuable to others, you probably need to publish it as well, so that others can use it to evaluate their algorithms; e.g., as an appendix to the paper or archive it permanently on some web address.

We will do it when the dataset is filled with a more significant number of records. As you mentioned, its biggest application value is data. Therefore, we are not sure if it makes sense to present detailed database structure in the appendix if the current list of data should be increased in the future.

 

 

-Or do you provide a method to automatically synthesize huge amounts of web pages for analysis? Then you have to show that your generated web pages are representative of actual web pages.

No, the web pages are labeled manually at the moment. Synthetic generation might not represent the real variety of web pages. Because of the manual labeling, the task is not easy and requires time and accuracy.

 

 

-Second contribution: Content block perception variety estimation. What is the purpose of this experiment? Please state your research question first. And then describe your research setup. How many people? Relevant background? How were they to do their rating? You could structure your experiment according to the IMRaD methodology. Google for Springer and IMRaD. They have a nice explanation. Next, present the data you collected. And then, discuss it. What is the benefit of the experiment? Why are the “collected variations” so important? What is the “content block variation problem” and why would one need an “initial variety scale”?

The initial aim was to design the dataset structure. However, during the dataset validation experiments, we obtained additional results that might be relevant in some other research. Therefore we present it as a contribution.
We apply the IMRaD methodology in our research. In the introduction, we present the importance, problem of the topic. In Section 3 we present the proposed dataset structure by explaining its design decisions. To get some adequate results, we execute an experiment. Therefore as it is an additional experiment, we divide Section 4 into subsections to present the Methodology how it will be executed and present the Results of the experiment. The Discussion is presented in the conclusion as well as the results section.
Because the paper has like a two layered IMRaD structure, the introduction of the internal structure is a consequence of the first part.
As we mentioned earlier, the Content block perception variety estimation was a consequential result, which we received from implementing the first contribution implementation.

 

 

-You claim that you “propose a dataset that would  provide various data points that can be used to benchmark or develop different types of algorithms and easily compare their performance.” I do not see how your experiment contributes to this claim.

Additional subsection (4.1) was added to reflect the dataset structure coverage for all analyzed web page segmentation solutions and highlight the additional features, which can be used in the future solutions

 

 

-Line 126: The word “Hero” might be wrong. May be “Title” or “Logo” or something like this would be a better word?

Hero block is a valid term broadly used within web development to describe the first introductory block on a website. It most commonly contains images, texts, links or forms. Due to this, we would like to keep the block naming as is

Author Response File: Author Response.docx

Reviewer 2 Report

I would like to thank the authors for significantly improving the paper with respect to the previous draft in a very short notice.

I still have a couple of concerns. First, since I am assuming the dataset will be publicly available, I would like to judge the dataset before accepting the current paper. I think the paper without the dataset has no value.

Second, I still believe that the study with only six participants is weak, especially since these blocks can be annotated in multiple ways in the HTML code. I think it is more interesting to see the comparison of, at least one but more would be desirable, existing techniques in the proposed dataset. How these techniques agree/disagree with manual labeling and whether the techniques help discern discrepancies between respondents. If the authors aim to use this dataset as a standard benchmark, this is just a necessary step.

I am marking this as a minor revision since I believe some of these techniques are publicly available and the authors should be able to perform the requested comparisons. If none of these techniques are really available, the authors should indicate that this is indeed a shortcoming and how to tackle it.

Author Response

-I would like to thank the authors for significantly improving the paper with respect to the previous draft in a very short notice.

-I still have a couple of concerns. First, since I am assuming the dataset will be publicly available, I would like to judge the dataset before accepting the current paper. I think the paper without the dataset has no value.

We present the dataset structure, not the dataset with a large number of records. We added some explanation that the dataset is a relational database. In the first version of the paper, we presented the dataset's physical structure; however, after reviewers comments, it was replaced by a more abstract form.
If you want we can add the database structure sql export file as supplemented material, however we do not feel it will be needed. The result is the dataset structure presented in the paper and experiment results, highlighting the need for variations in the dataset.

 

 

-Second, I still believe that the study with only six participants is weak, especially since these blocks can be annotated in multiple ways in the HTML code. I think it is more interesting to see the comparison of, at least one but more would be desirable, existing techniques in the proposed dataset. How these techniques agree/disagree with manual labeling and whether the techniques help discern discrepancies between respondents. If the authors aim to use this dataset as a standard benchmark, this is just a necessary step.

Additional subsection (4.1) was added to reflect the dataset structure coverage for all analyzed web page segmentation solutions and highlight the additional features, which can be used in the future solutions

 

 

-I am marking this as a minor revision since I believe some of these techniques are publicly available and the authors should be able to perform the requested comparisons. If none of these techniques are really available, the authors should indicate that this is indeed a shortcoming and how to tackle it.

Author Response File: Author Response.docx

Back to TopTop