Early Fault Detection in a Real Scenario of Hybrid Fiber–Coaxial Networks Using Machine Learning: An Approach Based on Decision Trees and Random Forests

Christian Szcerba; Enrique Dávalos; Ariel Leiva; Juan Pinto-Ríos

doi:10.3390/app151910442

,

and

¹

Polytechnic Faculty, Universidad Nacional de Asunción, Campus Universitario UNA, San Lorenzo 111421, Paraguay

²

School of Electrical Engineering, Pontificia Universidad Católica de Valparaíso, Av. Brasil 2950, Valparaíso 2340000, Chile

^*

Author to whom correspondence should be addressed.

Appl. Sci.2025, 15(19), 10442;https://doi.org/10.3390/app151910442

This article belongs to the Topic Machine Learning in Communication Systems and Networks, 2nd Edition

Version Notes

Order Reprints

Abstract

Cable service providers face significant challenges in managing Hybrid Fiber–Coaxial (HFC) networks due to the growing demand for high-speed services. Ensuring high service availability is critical to preventing customer attrition. This study employs machine learning techniques, specifically Decision Tree and Random Forest models, for proactive fault detection in HFC networks using data from the Simple Network Management Protocol (SNMP). Two operational scenarios were considered: a network-wide model and node-specific models. The dataset for fault detection exhibited a severe class imbalance, with outage events being extremely rare. To address this, the Synthetic Minority Oversampling Technique (SMOTE), which generates synthetic samples of the minority class to balance the dataset, was applied. This significantly improved recall and F1-scores—the harmonic mean of precision and recall—while maintaining high precision. The results demonstrate that these machine learning algorithms achieve up to 98% accuracy, and the SMOTE-enhanced models provide more reliable detection of connectivity faults. This approach is highly effective for cable operators in maintaining quality of service, enabling proactive management of problems and enhancement of network performance.

Keywords:

Data Over Cable Service Interface Specification; Hybrid Fiber–Coaxial; decision tree; random forest; machine learning

1. Introduction

Despite the growing adoption of fiber optic services globally [1], many Internet Service Providers (ISPs) still rely on DOCSIS (Data Over Cable Service Interface Specification) standards [2] over Hybrid Fiber–Coaxial (HFC) networks for high-quality broadcasts and Internet services [3]. Maintaining high service reliability is essential to mitigating customer attrition, and the implementation of proactive fault-prevention strategies can play a pivotal role in enhancing overall user experience (UX).

Aging infrastructures, cable corrosion, and network heterogeneity further complicate operations [4], and HFC networks require continuous upgrades to maintain service quality [5]. Upstream link problems caused by interference can degrade performance, often requiring labor-intensive manual interventions that disrupt services [6]. In this context, machine learning (ML) offers a promising approach by detecting early signs of faults from large volumes of network data. Supervised learning algorithms such as Decision Trees and Random Forests are particularly effective for this task [7].

This study examines two complementary scenarios for early fault detection in HFC networks using machine learning. Scenario 1 adopts a network-wide perspective, aggregating data across the entire network to identify global patterns, although with inherent limitations in isolating specific faults. In this scenario, due to the nature of the data acquisition, the target variable representing service outages is highly imbalanced. To address this issue, the Synthetic Minority Oversampling Technique (SMOTE) [8] was applied to generate synthetic samples of the minority class, thereby improving the robustness of the classification models. Scenario 2 conducts analysis at the node level, enabling more granular and sensitive detection of individual faults. This dual approach provides a comprehensive framework for evaluating the trade-offs and benefits associated with different levels of monitoring granularity in HFC networks. The HFC network analyzed has been built progressively over more than three decades, resulting in a heterogeneous infrastructure with components from different technological generations and natural wear and tear. Furthermore, optimal signal levels vary by geographic area, influenced by factors such as topography, subscriber density, and environmental conditions.

This article investigates the application of machine learning models for early fault detection in a real Hybrid Fiber–Coaxial (HFC) network operated by a telecommunications provider. The study leverages data collected through the Simple Network Management Protocol (SNMP) monitoring system. Using this real-world dataset, two fault detection models were developed: (i) a general model covering the entire network and (ii) a node-specific model targeting localized service issues. Both models, implemented using Decision Tree and Random Forest algorithms, are designed to improve the accuracy and efficiency of fault detection. The study further compares models trained on the original dataset and on the SMOTE-balanced dataset to evaluate the impact of resampling on outage detection performance.

The methodology comprises dataset creation, data preprocessing, model training, and performance evaluation. Section 2 presents a review of previous work on signal-level monitoring tools and the application of machine learning (ML) in HFC networks. Section 3 defines the research problem, while Section 4 details the conceptual framework, including network subdivision, DOCSIS standards, and ML algorithms. Section 5 outlines the methodology in depth, including business understanding, data pre-processing, and model performance assessment.

2. Bibliographical Analysis

Proactive fault detection is essential for ensuring the performance and reliability of HFC networks, and ML is increasingly being explored as a viable solution. Several studies have identified network monitoring variables capable of predicting faults. For example, in ref. [9] highlights the potential of ML in automating cable modem data collection and analysis. In ref. [10], Decision Trees are applied to detect radio frequency behaviors in Long-Term Evolution (LTE) networks, offering insights relevant for HFC data analysis. Similarly, in ref. [11] employs DOCSIS-based tools to visualize key network parameters, aiding in fault detection.

Research specifically targeting HFC networks has demonstrated promising results. For instance, in ref. [12] uses Full-Band Capture (FBC) data to cluster cable modems based on signal patterns, enabling proactive repairs of weak network segments. Unsupervised techniques like k-means and Gaussian mixture models have also been employed for anomaly detection [13], with principal component analysis improving accuracy. Other works, such as [14], propose big data platforms for fault detection, although adjustments are needed for non-tree topologies. Active monitoring approaches using pattern recognition have also proven effective in anomaly detection, with both FBC [12] and big data frameworks [14] serving as key enablers.

This review underscores the need for research tailored specifically to HFC networks. Our approach addresses this gap by applying ML to early fault detection using signal-level data, thereby advancing both fault detection and localization techniques for HFC network environments.

3. Problem Description

Cable service providers face significant challenges in managing Hybrid Fiber–Coaxial (HFC) networks, particularly due to the technical heterogeneity arising from network mergers. Older coaxial cables are especially susceptible to corrosion, which can lead to a degradation in connection quality [4].

The deployed network architecture employs distinct channels for upstream and downstream signal transmission. Upstream channels, which carry data from customer premises to network nodes, are particularly vulnerable to a phenomenon known as noise channeling, in which interference originating from a single household can propagate and disrupt multiple connections [6], thereby compromising uplink reliability. In operational practice, field technicians often use a binary search approach to locate malfunctioning devices, a process that involves disconnecting amplifiers and temporarily interrupting service during troubleshooting.

Proactive Network Maintenance (PNM) tools can assist in diagnosing network problems; however, they frequently produce excessive alarms, potentially overwhelming technicians [5]. Consequently, predicting failures prior to field intervention is critical to preventing service disruptions, underscoring the importance of predictive network management.

4. Conceptual Framework

This section outlines the conceptual framework adopted in this study, including the subdivision of the HFC network, the DOCSIS standard, machine learning, Decision Trees, and Random Forest.

4.1. HFC Network

Cable operators initially introduced high-speed data services over existing television infrastructures, with the bandwidth demand increasing significantly due to the rise of symmetric data applications, thereby enabling bidirectional IP traffic over HFC networks [2].

An HFC network is typically divided into two main sections: (i) the optical segment, which connects the central office to nodes via fiber optics, and (ii) the coaxial segment, which employs line amplifiers and splitters to deliver service to end users. Signals originate at the master headend and pass through the Cable Modem Termination System (CMTS) before reaching optical nodes, where they are converted to Radio Frequency (RF) for coaxial transmission [12]. This architecture facilitates fault detection, particularly within the segment between the CMTS and the optical nodes.

As illustrated in Figure 1, fiber nodes connect the network to customer cable modems, with amplifiers extending coverage to the final mile. Faults occurring in shared upstream channels can impact all devices connected to a given node.

Figure 1. Overview of the Hybrid Fiber–Coaxial (HFC) network architecture, showing the hierarchical structure and key data collection points used for SNMP-based fault detection [2].

4.2. DOCSIS

The development of HFC networks enabled broadband Internet access through DOCSIS, a set of standards ensuring high-speed data transmission over cable networks. Developed by CableLabs, DOCSIS supports a variety of services, including Internet access, high-definition television (HDTV), Voice over IP (VoIP), the Internet of Things (IoT), and Virtual Reality (VR) [2].

DOCSIS has evolved to address increasing bandwidth demands, with DOCSIS 3.1 introducing Orthogonal Frequency-Division Multiplexing (OFDM) for downstream and Orthogonal Frequency-Division Multiple Access (OFDMA) for upstream transmissions [5]. However, upstream links remain susceptible to noise and distortion [10], which can disrupt service for all modems within a fiber-node area [6]. The DOCSIS Full Band Capture (FBC) feature [12] enables remote spectrum analysis via the Simple Network Management Protocol (SNMP), offering cost advantages over traditional field analyzers. SNMP monitoring captures signal-level data from customer modems, including transmission and reception (TX/RX) power, which must remain within DOCSIS-defined thresholds to avoid communication issues. Additional performance metrics—such as error counts, corrected and uncorrected packets, Signal-to-Noise Ratios (SNRs), and system uptime—are also recorded to support automated analysis and troubleshooting.

4.3. Machine Learning

Machine learning (ML), a branch of Artificial Intelligence (AI) [15], is a powerful approach for analyzing large datasets and generating predictions for various variables [10]. It has been widely applied in domains such as medicine, technology, and telecommunications [16]. ML applications can generally be categorized into regression, classification, and decision-making tasks [17], with classification being the primary focus in this study for fault detection. ML has demonstrated notable success in industrial processes [18] and optical networks [19]. The main learning paradigms include supervised learning, which uses labeled data for prediction [20]; unsupervised learning; and reinforcement learning [21].

This study applies supervised learning techniques, comparing Decision Tree and Random Forest models for early fault detection by analyzing network variables.

4.4. Decision Trees

Achieving real-time fault detection requires the adoption of ML techniques capable of delivering rapid and accurate classification responses. Decision Trees (DTs) have been proposed for this purpose. DTs classify instances by evaluating feature values, where each internal node represents a feature and the branches denote possible outcomes [20]. A Decision Tree consists of a root node, intermediate decision nodes, and leaf nodes, which correspond to predicted class labels. This structure facilitates classification and prediction based on input variables and features [20].

4.5. Random Forest

Using bagging techniques, Random Forests generate multiple Decision Trees by randomly selecting subsets of the data during node splitting. This ensemble approach enhances accuracy by aggregating predictions from multiple trees. However, determining the optimal number of trees often requires empirical tuning, as increasing the number does not always yield significant performance gains [22]. Nonetheless, this characteristic allows the model to be readily adjusted to optimize performance.

5. Methodology for Early Fault Detection in HFC Networks Using ML

To develop a machine learning-based fault detection system, the CRoss Industry Standard Process for Data Mining (CRISP-DM) framework [23] was adopted. This methodology provides a structured yet flexible approach to data mining. Figure 2 summarizes the adaptation of CRISP-DM to the design of the proposed ML-based fault detector. The process comprises the following stages: (I) business understanding, (II) data understanding, (III) data preparation, (IV) modeling, and (V) evaluation. The implementation stage is not addressed in this study, as deployment conditions are beyond its scope.

Figure 2. Methodology schematic outlining the five stages of the CRISP-DM framework applied: business understanding, data understanding, data preparation, modeling, and evaluation. Deployment is not covered in this study.

In constructing the dataset for model training, the CRISP-DM methodology was applied as follows:

5.1. Business Understanding

In this phase, key operational factors in the DOCSIS network are identified from the perspective of the telecommunications operator. The objective is to determine the most relevant variables associated with connection failures and to establish how available data can support their detection.

Each cable modem is assigned a service IP address via DHCP for internet access and a management IP for remote administration through the Simple Network Management Protocol (SNMP), which enables signal monitoring and configuration changes. Leveraging the DOCSIS Management Information Base (MIB) structure [24], data such as MAC addresses, RX/TX power levels, Signal-to-Noise Ratios (SNRs), and system uptime are collected periodically.

5.2. Data Understanding

This stage defines the study scenario and determines key factors for identifying patterns in network failures, based on the variables established in the previous phase. The goal is to collect and explore relevant data, retaining patterns of interest while discarding irrelevant information.

To detect outages, logs from modems associated with affected nodes were captured. Fault flags were incorporated into the SNMP dataset, enabling identification of modems and signal readings corresponding to periods of failure.

Figure 3 presents an example of signal levels recorded for a cable modem, highlighting an outage period beginning at the timestamp 14:29. The horizontal axis represents the measurement timestamps, while the left vertical axis indicates the number of corrected packets and the right vertical axis shows the corresponding signal levels.

Figure 3. Example of signal-level measurements showing their behavior before and during a node outage event, highlighting key SNMP metrics relevant for fault detection.

In Scenario 1, the dataset exhibits a significant imbalance between the majority class (normal operation) and the minority class (outage). To mitigate this issue, the Synthetic Minority Oversampling Technique (SMOTE) was employed to generate synthetic samples of the minority class, thereby improving the capacity of machine learning models to learn patterns related to rare outage events [8].

Given the large volume of data collected by monitoring systems, not all records can be directly associated with the outage event. For the purposes of this study, only the variables exhibiting the strongest correlation with the failure were considered as key factors, as summarized in Figure 4. A correlation analysis was performed to validate the selection of these features.

Figure 4. Key HFC network factors for the database compiled during the business understanding phase.

5.3. Data Preparation

This stage involves preprocessing the dataset by extracting key features, handling missing values, and removing irrelevant attributes. Techniques such as normalization, scaling, and encoding are applied to ensure data consistency. The dataset is subsequently divided into training and testing subsets, following a common split ratio of 70% for training and 30% for evaluation.

Given the severe class imbalance in the dataset, with outage events being extremely rare, SMOTE was considered to generate synthetic samples of the minority class. This approach enhances the ability of the models to detect rare failures, improving recall and F1-scores while maintaining precision [25]. Preprocessing also includes filtering relevant variables, eliminating erroneous data, and adding features to enhance learning [10]. This stage is critical for the development of machine learning models, as the quality of input data directly influences prediction accuracy. Raw datasets often contain noise, outliers, and anomalies that can degrade performance. To address this, a flag column is introduced to identify records within the outage threshold, and invalid records are removed before the final split into training and evaluation sets.

5.4. Modeling

In this stage, machine learning (ML) models for failure prediction were developed using the prepared dataset. Classification techniques—encompassing algorithm selection, hyperparameter tuning, and cross-validation—were applied to mitigate overfitting and balance model complexity for improved generalization. Model performance was optimized using evaluation metrics such as accuracy, recall, and F1-score [16].

Decision Tree and Random Forest models were trained using binary labels (“non-outage” and “outage”). Two data collection scenarios were considered: (i) Scenario 1, which represents normal network operation using aggregated data from all nodes, and (ii) Scenario 2, which focuses on network failures by including only data from the affected node. Four models were thus trained: two Decision Tree models and two Random Forest models, one for each scenario.

For Scenario 1, each classifier was trained and evaluated twice: once using the original imbalanced dataset, and once using the SMOTE-balanced dataset. This comparison allowed the identification of the best-performing model in terms of its ability to detect outage events despite the imbalance. Scenario 2, by contrast, did not require resampling given its higher outage density.

The effectiveness of these models is presented in the following section.

5.5. Evaluation

In the final stage, the proposed ML models for failure prediction were evaluated against the reserved test data to validate their performance. This step ensures that the models meet the business objective of accurately predicting failures within the HFC network.

The test sets set aside during preprocessing were used to assess performance, and confusion matrices were generated for both Scenario 1 and Scenario 2. In Figure 5 and Figure 6, the left panel shows the confusion matrix for the Decision Tree model, while the right panel presents the results for the Random Forest model. Figure 7 displays the confusion matrices for both models when classifying events as “outage” or “no outage.” Each matrix is color-coded to represent true positives, false positives, false negatives, and true negatives.

Figure 5. Confusion matrices for Scenario 1 for proposed Decision Tree models.

Figure 6. Confusion matrices for Scenario 1 for proposed Random Forest models.

Figure 7. Confusion matrices for Scenario 2 for Decision Tree and Random Forest models.

This evaluation framework enables the calculation of key performance metrics such as precision, recall, and F1-score. The F1-score, defined as the harmonic mean of precision and recall, assigns equal weight to both metrics. It ranges from 0 to 1, where 1 represents perfect classification and 0 indicates the worst possible performance [20].

In addition to the models trained on the original dataset, SMOTE-based variants were also evaluated. These models incorporated synthetic oversampling of the minority outage class, allowing a more balanced training distribution. As a result, both Decision Tree and Random Forest classifiers trained with SMOTE achieved substantially higher recall and F1-scores compared to their original counterparts, while maintaining competitive precision.

The methodology described in the previous section was applied to a real Hybrid Fiber–Coaxial (HFC) network operating under a DOCSIS system. A total of 6,352,750 samples were collected by the monitoring system from the live network during the business understanding and data understanding phases. In Scenario 1, which incorporates signal level readings from all cable modems within a 5 h observation window, the Decision Tree model achieved high accuracy (99.98%) but low fault detection precision (19%). The Random Forest model obtained slightly higher accuracy (99.99%) but exhibited poor performance in identifying true positives, with 100% precision and very low recall (0.2%). This seemingly perfect precision is misleading, as it results from a severe class imbalance: only 0.0089% of the samples corresponded to outage events. This imbalance biased the model predictions and led to a low F1-score. By contrast, the SMOTE-enhanced Decision Tree model achieved 99.91% recall and 94.65% F1-score, while the Random Forest model obtained 99.97% recall and 94.68% F1-score, demonstrating the effectiveness of oversampling in mitigating class imbalance.

Figure 8 summarizes the performance of the evaluated models under both scenarios, using a diverse set of metrics to capture different aspects of classification quality. Accuracy reports the overall proportion of correctly classified samples, while Precision indicates the reliability of the model in identifying outages without producing false alarms. Recall measures the ability to detect actual outage events, which is critical for minimizing undetected failures. The F1-score balances precision and recall into a single indicator, making it particularly relevant in the context of imbalanced datasets [26]. The Area Under the Curve (AUC) further evaluates the discriminatory capacity of the models across thresholds. Additionally, Cohen’s Kappa accounts for agreement beyond chance, and the Matthews Correlation Coefficient (MCC) provides a balanced assessment of prediction quality, even in the presence of class imbalance [27,28].

Figure 8. Performance metrics of Decision Tree and Random Forest classifiers on original and SMOTE-enhanced datasets, including accuracy, precision, recall, F1-score, AUC, Cohen’s Kappa, and MCC to assess classification under class imbalance.

It is important to note that the low recall observed in Scenario 1 reflects the inherent challenges of modeling fault detection at a holistic network level when working with a large, highly imbalanced dataset in which failure events are extremely rare. In contrast, Scenario 2—by focusing on a specific node—enabled the models to achieve a substantially higher detection performance by leveraging localized data patterns. Moreover, the SMOTE-enhanced models demonstrated that oversampling transforms impractical classifiers into effective tools, yielding actionable insights for early fault detection in real-world operations.

These findings emphasize a critical point: models trained on the original, highly imbalanced dataset—especially in Scenario 1—exhibited recall values so low that they rendered the approach impractical for real-world fault detection, as most failure events were systematically missed. By contrast, incorporating SMOTE transformed these same classifiers into viable solutions, markedly increasing recall and F1-scores while maintaining competitive precision. This demonstrates that oversampling techniques are not merely optional adjustments, but rather essential steps when addressing the intrinsic rarity of failure events in large-scale HFC networks.

6. Conclusions

This study investigated fault detection in Hybrid Fiber–Coaxial (HFC) networks using Decision Tree and Random Forest models, focusing on dataset preprocessing, operational scenarios, and performance evaluation. In Scenario 1, models trained on the original dataset achieved high accuracy but suffered from low recall due to a severe class imbalance, whereas Scenario 2, focused on a single node, benefited from localized patterns and yielded a higher detection performance. The application of SMOTE effectively mitigated class imbalance, substantially improving recall and F1-scores and transforming models into practical tools for real-world deployment. As future work, extending the methodology to a multiclass framework by labeling different types of failure events—such as partial degradations, intermittent outages, or complete service interruptions—would enable classifiers to provide more granular insights, supporting proactive maintenance and resource optimization. Overall, these results demonstrate that combining robust machine learning models with appropriate preprocessing techniques—such as SMOTE—bridges the gap between theoretical performance and operational applicability, laying the foundation for proactive fault management and enhanced service quality in HFC networks.

Author Contributions

Conceptualization, J.P.-R.; methodology, J.P.-R.; software, C.S.; validation, J.P.-R. and C.S.; formal analysis, J.P.-R.; investigation, J.P.-R.; resources, A.L. and E.D.; data curation, C.S.; writing—original draft preparation, J.P.-R., C.S., A.L. and E.D.; writing—review and editing, J.P.-R., C.S., A.L. and E.D.; visualization, J.P.-R. and C.S.; supervision, A.L. and E.D.; project administration, A.L. and E.D.; funding acquisition, A.L. and E.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Agencia Nacional de Investigación y Desarrollo (ANID-Chile) Doctorado Nacional, grant number 2022-21220867, and Agencia Nacional de Investigación y Desarrollo (ANID-Chile) FONDECYT REGULAR, grant number 1241362. The APC was funded by ANID.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy reasons.

Conflicts of Interest

The authors declare no conflicts of interest.

References

OECD. Percentage of Fibre Connections in Total Broadband. December 2023. Available online: https://www.oecd.org/en/topics/sub-issues/broadband-statistics.html (accessed on 4 October 2024).
CableLabs. DOCSIS 4.0 MAC and Upper Layer Protocols Interface Specification. Specifications CM-SP-MULPIv4.0, CableLabs. 2023. Available online: https://www.cablelabs.com/specifications/CM-SP-MULPIv4.0 (accessed on 13 May 2024).
Cisco. Cisco Annual Internet Report (2018–2023) White Paper. Cisco, 2020. Available online: https://www.cisco.com/c/en/us/solutions/collateral/executive-perspectives/annual-internet-report/white-paper-c11-741490.html (accessed on 24 September 2025).
Ji, R.; Gao, J.; Xie, G.; Flowers, G.T.; Jin, Q. The impact of coaxial connector failures on high frequency signal transmission. In Proceedings of the 2015 IEEE 61st Holm Conference on Electrical Contacts (Holm), San Diego, CA, USA, 11–14 October 2015; pp. 298–303. [Google Scholar] [CrossRef]
CableLabs. PNM Best Practices Primer: HFC Networks (DOCSIS 3.1). Specifications CM-GL-PNM-3.1-V01-200506, CableLabs. 2024. Available online: https://www.scribd.com/document/563981463/CM-GL-PNM-3-1-V01-200506-3 (accessed on 13 May 2024).
Williams, T.H.; Hunter, D. Isolating an Upstream Noise Source in a Cable Television Network. U.S. Patent 9,729,257, 8 August 2017. [Google Scholar]
Mahmoud, H.H.H.; Ismail, T. A review of machine learning use-cases in telecommunication industry in the 5G Era. In Proceedings of the 2020 16th International Computer Engineering Conference (ICENCO), Cairo, Egypt, 29–30 December 2020; pp. 159–163. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Abedin, S.; Ben Ghorbel, M.; Hossain, M.J.; Berscheid, B.; Howlett, C. A Novel Approach for Profile Optimization in DOCSIS 3.1 Networks Exploiting Traffic Information. IEEE Trans. Netw. Serv. Manag. 2019, 16, 578–590. [Google Scholar] [CrossRef]
Villamar, V.; Rocha, C.; Navarrete, H.; Lupera-Morillo, P. A Predictive Handover Approach in LTE Networks with Measurements and Decision Tree Algorithms (Case Study City of Quito). Rev. Politécnica 2023, 52, 15–24. [Google Scholar] [CrossRef]
Benhavan, T.; Songwatana, K. HFC network performance monitoring system using DOCSIS cable modem operation data in a 3 dimensional analysis. In Proceedings of the 4th Joint International Conference on Information and Communication Technology, Electronic and Electrical Engineering (JICTEE), Chiang Rai, Thailand, 5–8 March 2014; pp. 1–5. [Google Scholar] [CrossRef]
Gibellini, E.; Righetti, C.E. Unsupervised Learning for Detection of Leakage from the HFC Network. In Proceedings of the 2018 ITU Kaleidoscope: Machine Learning for a 5G Future (ITU K), Santa Fe, Argentina, 26–28 November 2018; pp. 1–8. [Google Scholar]
Millicom. Millicom Earnings Release Q1 2024; Technical Report; Millicom: Luxembourg, 2024. [Google Scholar]
Simakovic, M.; Cica, Z. Detection and localization of failures in hybrid fiber–coaxial network using big data platform. Electronics 2021, 10, 2906. [Google Scholar] [CrossRef]
Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef]
Choudhary, R.; Gianey, H.K. Comprehensive review on supervised machine learning algorithms. In Proceedings of the 2017 International Conference on Machine Learning and Data Science (MLDS), Noida, India, 14–15 December 2017; pp. 37–43. [Google Scholar]
Dargan, S.; Kumar, M.; Ayyagari, M.R.; Kumar, G. A Survey of Deep Learning and Its Applications: A New Paradigm to Machine Learning. Arch. Comput. Methods Eng. 2020, 27, 1071–1092. [Google Scholar] [CrossRef]
Lei, Y.; Yang, B.; Jiang, X.; Jia, F.; Li, N.; Nandi, A.K. Applications of machine learning to machine fault diagnosis: A review and roadmap. Mech. Syst. Signal Process. 2020, 138, 106587. [Google Scholar] [CrossRef]
Wang, D.; Zhang, C.; Chen, W.; Yang, H.; Zhang, M.; Lau, A.P.T. A review of machine learning-based failure management in optical networks. Sci. China Inf. Sci. 2022, 65, 211302. [Google Scholar] [CrossRef]
Kotsiantis, S.B.; Zaharakis, I.; Pintelas, P. Supervised machine learning: A review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 2007, 160, 3–24. [Google Scholar]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep Reinforcement Learning: A Brief Survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Oshiro, T.M.; Perez, P.S.; Baranauskas, J.A. How many trees in a random forest? In Machine Learning and Data Mining in Pattern Recognition, Proceedings of the 8th International Conference, MLDM 2012, Berlin, Germany, 13–20 July 2012; Proceedings 8; Springer: Berlin/Heidelberg, Germany, 2012; pp. 154–168. [Google Scholar]
Elkabalawy, M.; Al-Sakkaf, A.; Abdelkader, E.M.; Alfalah, G. CRISP-DM-Based Data-Driven Approach for Building Energy Prediction Utilizing Indoor and Environmental Factors. Sustainability 2024, 16, 7249. [Google Scholar] [CrossRef]
Woundy, R.; Marez, K. Cable Device Management Information Base for Data-Over-Cable Service Interface Specification (DOCSIS) Compliant Cable Modems and Cable Modem Termination Systems. RFC 4639, Internet Engineering Task Force (IETF), December 2006. Available online: https://www.rfc-editor.org/info/rfc4639 (accessed on 24 September 2025).
Matharaarachchi, S.; Domaratzki, M.; Muthukumarana, S. Enhancing SMOTE for imbalanced data with abnormal minority instances. Mach. Learn. Appl. 2024, 18, 100597. [Google Scholar] [CrossRef]
Williams, C.K.I. The Effect of Class Imbalance on Precision-Recall Curves. Neural Comput. 2021, 33, 853–857. [Google Scholar] [CrossRef]
Chicco, D.; Tötsch, N.; Jurman, G. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Mining 2021, 14, 13. [Google Scholar] [CrossRef]
Chicco, D.; Jurman, G. A statistical comparison between Matthews correlation coefficient (MCC), prevalence threshold, and Fowlkes–Mallows index. J. Biomed. Inform. 2023, 132, 104426. [Google Scholar] [CrossRef] [PubMed]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Early Fault Detection in a Real Scenario of Hybrid Fiber–Coaxial Networks Using Machine Learning: An Approach Based on Decision Trees and Random Forests

Abstract

1. Introduction

2. Bibliographical Analysis

3. Problem Description

4. Conceptual Framework

4.1. HFC Network

4.2. DOCSIS

4.3. Machine Learning

4.4. Decision Trees

4.5. Random Forest

5. Methodology for Early Fault Detection in HFC Networks Using ML

5.1. Business Understanding

5.2. Data Understanding

5.3. Data Preparation

5.4. Modeling

5.5. Evaluation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics