Next Article in Journal
Towards Controllable and Explainable Text Generation via Causal Intervention in LLMs
Previous Article in Journal
Digital Convergence in Dental Informatics: A Structured Narrative Review of Artificial Intelligence, Internet of Things, Digital Twins, and Large Language Models with Security, Privacy, and Ethical Perspectives
Previous Article in Special Issue
Critical Considerations for Observing Cross Quantum Capacitance in Electric-Double-Layer-Gated Transistors Based on Two-Dimensional Crystals
 
 
Article
Peer-Review Record

Toward the Implementation of Text-Based Web Page Classification and Filtering Solution for Low-Resource Home Routers Using a Machine Learning Approach

Electronics 2025, 14(16), 3280; https://doi.org/10.3390/electronics14163280
by Audronė Janavičiūtė, Agnius Liutkevičius * and Nerijus Morkevičius
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3:
Electronics 2025, 14(16), 3280; https://doi.org/10.3390/electronics14163280
Submission received: 8 July 2025 / Revised: 16 August 2025 / Accepted: 18 August 2025 / Published: 18 August 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

It is very important to restrict and filter harmful content on the Internet. The paper studies the use of machine learning algorithms to filter unwanted web content in real-time in a resource limited end-user environment, which has great practical application value. Experiments have shown that the linear support vector machine algorithm implemented in C/C++has the highest classification accuracy and is also relatively fast. The following questions are provided for the author improving the paper: 1) The features of the dataset used in the paper should be fully described, such as how much harmful web content should be restricted and filtered? 2) The content that needs to be restricted should be limited, which requires addressing class imbalance issues, but the paper does not elaborate on this aspect. 3)The paper should propose improved algorithms for web filtering, especially based on the latest research results, and compare them with the latest SOTA algorithms. 4) There are too few references cited in the paper regarding the latest journal and conference papers on webpage content filtering and restrictions.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Practical Application and Evaluation of Text-Based Web Page Filtering for Home Routers Using Machine Learning Approach

 

The work is interesting, but somewhat misleading in the title. While semantically correct, I think it would be beneficial to review the title and phrase it as a step towards a implementation rather than leading a reader to expect a full implementation form the title and abstract. 

The work undertaken is sound but there needs to be amore detail provided and a clearer  specification of the limits and scope of the work.  The auhots need to be very clear in their focus and what is actually covered.

The TL;DR version of the work, is a Tesxt parser was  built and cross compiled to run under openWRT on an embedded system.  This in itself has some novelty, but is not what the paper advertises.

A solid description of thw rok undertaken , sets up the authors for a follow up work looking at a more complete system ( or possibly rescaling the system to not do Deep inspection, but just lookinga t classification of the domain/URL – this would be particularly relevant in malware/phishing defense.

 

A delay of ~2 seconds to access a page is notable. It would be useful to quote/evaluate some usability metrics  for web applications in terms of response time.

Unclear why the focus is on pornography when something like malware /phishing would be a much more appropriate and generic issue to look at. The rationale behind the focus needs to be discussed, and this based on more than just [3].

The discussion around filtering placement, ignores the ‘elephant’ of encrypted communication, not only for https , but the increasing use of DoH or DoT secure DNS.

Strongly suggest that much of the background is reviewed and a focus is made on generic filtering ( section 1.1 – alternately there needs to be a much stronger case why the adult content is chosen) Related work (1.3) would be better placed in section 2 as is typical. with the current Section 2 becoming  section 3. The authors are encourages to review journal papers published to see what is typical.

 

The testing lacks a key element, of how the lists were tested on the router, what is inferred is the list was cycled through all the entries in the test set and then evaluated, how many times was this done, were different draws done.

What does not seem to have been done was actually implementing any real functionality to pik up/proxy/intercept requests form clients.

The performance looks to be for single threaded requests. How would this scale if there were multiple concurrent clients. 

 

Given the constraints, the approach may well be more appropriate for just dealing with URLS/domain names, than actual text.

 

Looking forward a couple of issues that need to be considered:

How will HTTPS content  be dealt with?

How will actual interception work

Dealing with DoT/DoH ?

 

The authors are thanked for making the data available, although ~2700 urls is a fairly limited content., given that 115  (~4%) are tagged as adult. Similarly several are not just the site name, but are either full URLS or partial paths.

They are encouraged to publish example code on a repository such as github.

 

Line 19 – RAM is common enough and clear form context it likely doe does not need to be spelled out.

Line 20 as above for CPU

Line 28 – this is old data given we are half way though 2025! Can this be updated

Line 33, for context, state the year of the study.

Line 86-  the ‘power’ of the servers is very much dependant on the approach taken. Content inspection s expensive, but DNS  or routing blockholes are very lightweight. Justification must be provided.

Line 98- more detail explanation on traffic interception is needed.

Line 112- here is an example of where there is a wealth of published work on URL filtering  ( largely focussed on malware/phishing  but still relevant)

Line 114 – a better reference is needed here than what [13] currently is.

Table 1additional hardware may well be needed at the ISP level depending on the approach taken. Similarly the rankings of difficult, are somewhat subjective. Dependant on method approach etc. The table is somewhat naive in its assumptions.

Line 127 – scalability is an issue here, especially as larger numbers of users are present. A home may have <6 users while at the other end one is potentially looking at 100’s

NB: make clear mention that dataset is available and the size of it.

Line 229 – be clear what was used in training – the URL or the page text. In the case of where sites only are present this could result in a lot of changing text.  How was this felt to be accurate?

Line 231 – what was this split based on, and how was it ensured that this was balanced?

Line 243 – name the algorithm, and then cite. Justify how this was selected.

Line 256 – text should ideally flow around figures and tables. Avoid blocks of whitespace like this.

Line 257 – consider the duplication between the text and the diagram. Are both really needed, could fig 2 be adjusted to demonstrate the flow?

Lines 262-264 – this is somewhat repetition, review and clarify based on earlier statements. See line 238

Line 267 – mention of where code is available? Issues with cross compilation, “installed on the router” glosses over a range of issues, what was the router running, how was additional code loaded, was this running sock firmware or OpenWRT/tomato? (answered on line 340 but should be disclosed earlier along with a more detailed discussion)

 

Lines 275-290 – there is repetition here that can be removed. Minimise redundancy. Authors need to avoid glossing over details such as how the cleaning was done. It would be beneficial to link to where scripts/preprocessing and the router software are available.

Line 291 – excessive white space follows.

Line 293, the caption should not be split from the image.

Line 294 – “The software timers” but these have not been introduced yet. Suggest just “software timers” without the definitive article.

Line 305 – justify this adjustment to the data.

Line 317 – comment on memory utilisation which is typically constrained given the need for a python VM ?

Line 327 -  section number is sufficient name not needed and should be omitted.

Line 336 – how was word removal dealt with?

Line 339 – what is the relevance of the windows SDK. or is this for the building of the training side of the toolchain. Be clear and specific. Do not rely on readers inferring. What compiler was used for the various components

 

Line 342- was the OpenWRT compiled form source, or the source of the tools compiled using this (the latter is assumed but the text is ambiguous)

 

Line 348 – as expected the router is resource constrained especially in RAM.  Commentary on how much ram was free post BOOT  and in normal operation before the system was loaded would be insightful. Would it be worth having a footnote to the product page on ASUS website ?

Line 369 – do not split tables over page.

Line 373 – 0.11s would not be noticeable, but 1.8 seconds would especially if on every page ?

Line 392 – there are fairly large and well maintained lists available blocking such content, often running into the 1000’s of sites. This sounds like more research is needed. Many  commercial adult sites (porn/gambling) are very well established web properties – see  lists lie the majestic million.

 

Line 412 – there needs to be some further discussions of the shortcomings of the testing as well as the impact the testing had on normal router operation – e.g. was latency increased in transit traffic and in ping responses on the LAN?

 

Line 480 – this is repetition, specs have been stated earlier

 

Lien 486 – this is a bold statement to make given the lack of ancillary testing on the impact, or a complete solution.  How large was the impact of the datafile and code  again given the constraints of the embedded storage.

 

References

These are in poor shape.  There is a notable lack of appropriate comparative work. While pornography  is one major issue, it is surprising that there is no links to any phishing /malware distribution works which are a much broader and generic problem which merit blocking everywhere.

 

1,2 – need institutional author

3 – missing author

6 – the author is missing , this is FCC/US gov

7 – URL – be consistent with others.

8 – author needed, cite as wp/TReport

9 – Author needed, cite specific not the list in url to all treaties. Be specific eg https://rm.coe.int/1680084822

10 – institutional author needed

13 – a better quality reference can be found for this. Author needed

16,19,20 – doi as other pubs ?

23 – where can this be found/how published ? link back to the details provided on line 504

24 – author needed, fix url hyperlink. Author name listed on GitHub page.

25 – “18443182526 Bytes.” ???  how/where published.

27-31 – author needed, could this not just be a footnote or inline url as it points to a landing page rather than a specified piece of information?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper utilizes machine learning for a home router filtering system. However, in order to be considered for publication, the authors must improve the overall quality of the paper and address the following concerns:
1) The authors need to strengthen the contributions of the paper. The current level of contribution is insufficient for publication. In addition, the authors should describe the main contributions in greater detail and with improved clarity.
2) The authors should provide a clearer explanation of the proposed system architecture. The current description is insufficient for understanding how the system is structured and operates.
3) The authors should obtain and present additional results to offer deeper insights into the effectiveness of the proposed system.
4) The authors need to clearly explain the architecture of the machine learning component used in the filtering system.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Even when filtering or blocking the entire webpage, there is still an imbalance issue that needs to be addressed, as these webpages that need to be filtered or blocked are still a minority. After all, the web pages that need to be filtered are a minority. Therefore, the method proposed in the paper must include how to solve imbalance problems. This is exactly where the paper needs innovation.

 

Author Response

1. Summary

Thank you very much for taking the time to review this manuscript and for your valuable comments that allowed us to improve the article. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the resubmitted files.

2. Point-by-point response to Comments and Suggestions for Authors

Comments 1: Even when filtering or blocking the entire webpage, there is still an imbalance issue that needs to be addressed, as these webpages that need to be filtered or blocked are still a minority. After all, the web pages that need to be filtered are a minority. Therefore, the method proposed in the paper must include how to solve imbalance problems. This is exactly where the paper needs innovation.

Responses 1: Thank you for pointing this out. That’s right. The imbalance issue is an important one, but as was written in the updated Introduction section and the Model of the Proposed System section, in our case there are no “bad” or “good” categories. For this reason, we made a new dataset with almost the same number of records per category, except for adult. Parents or network administrators can want to filter any kind of content, including business, sport, etc. For example, parents want to block access to any e-shop, avoiding unsupervised buying and using their credit card. However, the main focus of this study was to find out if the home router is capable of running ML-based algorithms with sufficient speed and accuracy. Although some classes, such as adult, have fewer records, the results of the study show a high potential of the proposed approach with sufficient accuracy and processing time. As was written in the updated Discussion section, future work should cover imbalance issues to further improve the performance of the proposed system.

Reviewer 2 Report

Comments and Suggestions for Authors

Toward the Implementation of Text-Based Web Page Classification and Filtering Solution for Low-Resource Home Routers  Using a Machine Learning Approach

Review 2

 

The authors are thanked for their detailed and engaged response.The paper is noticeably improved.

The title is much more appropriate and better reflects the work described.

The paper still has an over use of references focussing around adult content sites, when in the response the authors note that this is actually a minor part.  The introduction in particular needs to be re-worked, along with other mentions.  As the authors state this is not the priority it is just one category, if this is the case there needs to be further discussion of other threats. The introduction remains much longer than is typical, and some editin needs to be considered.

 

The discussion around delay is improved but  there needs to be some further evaluation of this as part of the prototoype/shortcomings of the work. There is no detail as to how non html/text  urls would be skipped, cached etc. in this sense the work is fairly limited ( although successful) and these need need further discussion.

 

Regarding the dealing with HTTPS, there is mention of domain filtering, which is effectively out of scope of what this system is trying to prove is possible, but certain part of a more complete solution. Domain extraction is not always necessarily straightforward – eg SSL proxies are just one bypass means.  The DOH/DoT, will require appropriate firewall configurations, although this again becomes challenge chains all ports down, and very difficult to prevent DoH running on 443/tcp. There is the discussion of this being used as a parental control, bypassing the suggested approaches is not likely to be difficult for many teenagers 😉

 

 

 

The authors are thanked of the code availability.

Specific comments:

Lines 36-59 – review carefully to de emphase adult content and include other common threat areas and undesirable content.

Line 80 – the citations simply link to product pages, how were these determined to be the most popular ? consider rewording as “popular applications include...” ( and also consider urls as footnotes or inline?)

Line 336 -can the quality of Figure 3 be improved ?

Line 572 – typically no space before the “s” unit. Be consistent, in the page following seconds is written out in full.

 

References

[9-13] could these not be better uplaced as footnotes to the products when mentioned ?

[22] make doi consistent with other cases

[31] a doi is available if you export the cition as bibtex, this is significantly shorter - 10.6084/m9.figshare.19406693.v5

[33] this just points to a homepage, could be done as a footnote.

The references are otherwise improved.

Author Response

1. Summary

Thank you very much for taking the time to review this manuscript and for your valuable comments that allowed us to improve the article. We are very grateful for Your insights and very detailed comments and suggestions, which were all addressed while making revisions. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the resubmitted files.

2. Point-by-point response to Comments and Suggestions for Authors

Comments 1: The paper still has an over use of references focusing around adult content sites, when in the response the authors note that this is actually a minor part.  The introduction in particular needs to be re-worked, along with other mentions.  As the authors state this is not the priority it is just one category, if this is the case there needs to be further discussion of other threats. The introduction remains much longer than is typical, and some editing needs to be considered.

Response 1: Thank you for pointing this out. We removed text mentioning 'adult' content from the Introduction almost completely. Also, additional possible filtering categories were mentioned in the text.

Comments 2: The discussion around delay is improved but there needs to be some further evaluation of this as part of the prototoype/shortcomings of the work. There is no detail as to how non html/text urls would be skipped, cached etc. in this sense the work is fairly limited (although successful) and these need further discussion.

Comments 3: Regarding the dealing with HTTPS, there is mention of domain filtering, which is effectively out of scope of what this system is trying to prove is possible, but certain part of a more complete solution. Domain extraction is not always necessarily straightforward – eg SSL proxies are just one bypass means.  The DOH/DoT will require appropriate firewall configurations, although this again becomes challenge chains all ports down, and very difficult to prevent DoH running on 443/tcp. There is the discussion of this being used as a parental control, bypassing the suggested approaches is not likely to be difficult for many teenagers ?

Response 2 and 3: Agree. We have, accordingly, revised the Discussion section and described the limitations of our study.

Comments 4: Specific comments:

Lines 36-59 – review carefully to de emphase adult content and include other common threat areas and undesirable content.

  • Done.

Line 80 – the citations simply link to product pages, how were these determined to be the most popular ? consider rewording as “popular applications include...” ( and also consider urls as footnotes or inline?)

  • Done.

Line 336 -can the quality of Figure 3 be improved ?

  • All figures were revised, increasing their resolution.

Line 572 – typically no space before the “s” unit. Be consistent, in the page following seconds is written out in full.

  • The units were made consistent in that section.

Comments 5: References

[9-13] could these not be better uplaced as footnotes to the products when mentioned ?

[22] make doi consistent with other cases

[31] a doi is available if you export the cition as bibtex, this is significantly shorter - 10.6084/m9.figshare.19406693.v5

[33] this just points to a homepage, could be done as a footnote.

Response 5: References were updated.

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have already addressed the previous round’s comments. I do not have further comments. 

Author Response

Dear Reviewer,

Thank you very much for taking the time to review this manuscript and for your valuable comments that allowed us to improve the article.

Best Regards,

Authors

Back to TopTop